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Solving  the  Quadratic  0-1  Problem 

Schiitz,  G.1,  Pires,  F.  M.2,  Ruano,  A.  E.2’3 
1  Eseola  Superior  de  Tecnologia,  U ni versidade  do  Algarve 
2Unidade  de  Ciencias  Exactas  e  Humanas,  Universidade  do  Algarve 
^Institute  of  Systems  &  Robotics 


Abstract.  The  quadratic  0-1  programming  is  a  discrete  optimization  problem,  with 
many  important  applications.  Difficult  graph  problems  can  be  formulated  and 
solved  as  a  quadratic  0-1  programming  problem. 

This  is  a  NP-hard  combinatorial  problem  very  difficult  to  solve,  even  if  the 
dimension  is  small.  The  branch-and-bound  algorithms  are  the  most  used  for  solving 
exactly  this  sort  of  problems. 

In  this  paper,  based  on  an  efficient  sequential  branch-and-bound  algorithm  for  the 
unconstrained  quadratic  0-1  programming,  we  study  the  behaviour  of  its  parallel 
implementation  using  transputers  and  present  some  computational  results.  We  also 
analyse  the  workload  distribution  among  processors. 

Keywords:  Quadratic  0-1  programming,  Branch  and  Bound  Algorithms,  Parallel 
Numerical  Algorithms 


1  Introduction 

In  this  paper  we  are  dealing  with  the  unconstrained  quadratic  0-1  program: 

min  f  ( x )  =  qTx  +  14  xT  M  x  (  1  ) 

xe{0,  1}" 

with  q  e  IPf1  and  M  e  RnXn  . 

The  quadratic  0-1  program  has  many  interesting  applications,  for  instance,  is  applied 
to  financial  analysis  problems  [6],  CAD  problems  [4],  circuit  layout  design, 
distributed  computer  networks  and  telecommunication  networks  [1].  Some  difficult 
graph  problems  (like  the  maximum  clique  problem)  can  also  be  formulated  and  solved 
as  a  quadratic  0-1  programming  problem. 

As  problem  (1)  has  so  many  applications,  it  is  worthwhile  investing  some  effort  in 
solving  it.  One  way  of  solving  this  problem  is  to  use  a  branch-and-bound  algorithm. 
Using  a  branch-and-bound  algorithm  means  to  split  the  original  problem  into 
subproblems  building  a  search  binary  tree.  Each  of  the  new  subproblems  must  be 
either  solved,  or  pruned  if  we  can  prove  that  it  doesn't  yield  to  a  better  solution.  The 
search  for  good  pruning  techniques  has  been  a  matter  of  research  for  the  last  years  [3], 
[11]. 

Another  crucial  aspect,  when  solving  this  kind  of  problems,  is  the  need  to  produce  a 
"good"  initial  solution.  Some  heuristics  have  been  proposed  in  the  last  years  [3],  but 
further  investigation  is  needed  in  this  field,  namely  on  how  to  use  parallel  processing 
to  obtain  a  good  initial  guess. 
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Finally  the  search  strategy  is  very  important.  The  depth-first,  the  breadth-search  and 
heuristic  search  strategies  have  been  proposed  in  the  context  of  sequential  algorithms. 
When  dealing  with  parallel  algorithms,  the  search  strategy  must  be  designed 
accordingly  to  the  topology  and  the  number  of  processors  in  order  to  obtain  a  more 
efficient  method. 

The  main  purpose  of  this  work  is  to  parallelize,  on  transputers,  a  branch-and-bound 
algorithm  for  the  quadratic  0-1  problem,  and  study  its  behaviour.  In  section  2  we 
summarise  the  branch-and-bound  algorithm  and  the  heuristic  for  finding  the  initial 
solution.  Section  3  presents  the  main  ideas  behind  the  parallelization  of  the  described 
algorithm.  In  section  4  we  present  some  computational  results  using  1,  2,  4  and  8 
transputers. 


2  A  Sequential  Algorithm 


The  solution  y*  of  the  continuous  quadratic  problem  constrained  to  the  hypercube 
given  in  (1)  is  also  a  solution  of  the  linear  program  [9]: 

min  (Vf(y*))Ty  (2) 

0<y<e  '  ' 

This  implies  that  variables  whose  partial  derivatives  have  fixed  sign  in  the  unitary 
hypercube  can  be  fixed  either  to  0  or  1  according  to  that  sign. 

In  order  to  make  calculations  easier,  problem  (1)  can  be  formulated  equivalently  [9]  as 
min  f  ( y )  =  yT  A  y  (  3  ) 

ye  I  0,  1  }n 

where  a;j  =  Cj+q^  /2  and  for  i*j  a..  =  qjj  n  ,  i,  j  =  l,...,n  . 

Without  any  loss  of  generality  the  matrix  A  can  be  considered  to  be  symmetric. 

It  is  possible  to  show  [8]  that  for  problem  (3)  we  obtain  the  minimum  range  of 
the  partial  derivatives 


af(y) 

m-  <  — - <M  for  i  =  l,...,n 

ayi 

n  n 

where  m;  =  2  2>:  +  aH  andM;  =  2  +  au 

j=l  j=i 


with  ajj  =  max  {  0,  a-  }  and  ajj  =  min  {  0,  a^  } . 

This  provides  an  easy  way  of  forcing  variables: 

a)  nr  >  0  =>  y*=  0; 

b)  Mj<0=>  y*=  1. 

Hence,  the  gradient  of  the  objective  function  characterises  the  difficulty  of  the 
problem,  enabling  to  obtain  smaller  trees  when  it  is  possible  to  force  more  variables  in 
the  solution  on  the  initial  node.  As  it  can  be  seen,  from  the  formulas  to  evaluate  nu 
and  Mj,  special  characteristics  of  the  matrix  A  will  be  determinant  on  the  number  of 
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forced  variables,  namely  matrices  with  a  great  number  of  diagonal  dominant  elements 
will  provide  shorter  trees. 

This  process  of  forcing  variables  will  be  repeated  in  any  node  of  the  tree,  leading  to  a 
smaller  range  of  the  partial  derivatives  of  the  non-fix  ed  variables. 

When  working  with  branch-and-bound  algorithms  the  main  concern  is  to  reduce  _the 
potential  length  of  the  search  tree,  which,  at  first  glance,  might  have  as  much  as  2n  - 
1  nodes.  Besides  using  efficient  techniques  to  fix  variables,  it  is  also  crucial  to  use 
good  pruning  rules.  We  chose  to  use  a  lower  bound  to  the  objective  function  as  a 
pruning  rule.  If  in  a  node  the  lower  bound  function  has  a  value  that  is  worse  than  the 
incumbent  minimum,  then  that  branch  must  be  pruned,  as  that  subproblem  can  never 
lead  to  a  better  solution.  We  used  a  lower  bound  function  that  gives  a  close  bound  to 
the  optimum  value  and  is  very  easy  to  evaluate  and  to  update.  To  obtain  this  lower 
bound  to  the  function  f,  we  used  the  fact  that  its  best  possible  value  corresponds  to  add 
the  rows  with  negative  contribution  to  the  objective  function.  This  lower  bound  is 
easily  computed  from  the  limits  of  the  gradient  interval.  Let 

F  =  {i :  y ,•  is  fixed}  and  F  =  {i :  yt  is  free},  we  computed  the  lower  bound  of 
the  gradient  interval  for  variable  yh  lb„  as: 


Ibi  = 


X  aijx  j  +  X_ay  +  ~Zaii 


jeF 

j*i 


jeP 

j*i 


Xj  ,  is  F 


i  e  F 


(4) 


Ha,jXj  +  X_«,y  +  ~au 

jeF  jeF  z 

j*i  j*i 

and  then  the  lower  bound  of  the  objective  function  given  in  (3)  is  easily  computed,  as 
it  is  equal  to: 

f  1 


X  Ibi  +  7  aii 

ieF  1 


x_ 

ieF 


!bi  +  —«u 


(5) 


The  update  of  (4)  and  (5)  is  easily  and  quickly  done. 

Another  important  aspect,  in  a  branch  and  bound  algorithm,  is  the  order  in  which  new 
subproblems  are  generated,  that  is,  to  choose  the  variable  to  branch  in  each  node.  Like 
in  [9]  we  chose  to  select  the  variable  which  is  most  unlikely  to  be  forced  in  subsequent 
levels  of  the  tree.  This  leads  to  the  rule  of  choosing  the  branching  variable 
corresponding  to  the  maximum  of  the  values  minj-m^,  M^}  for  all  the  variables  not 
yet  fixed.  This  rule  has  the  additional  advantage  of  reducing  the  gradient  range  of  the 
remaining  free  variables,  which  is  favourable  for  fixing  more  variables.  The  value  (0 
or  1)  assigned  to  that  selected  variable  is  the  one  that  decreases  lower  bound  function 
the  most. 

The  starting  point  is  also  very  important  in  order  to  obtain  small  trees.  As  a  matter  of 
fact,  if  the  initial  solution  is  near  the  optimum  there  is  a  high  possibility  of  pruning 
branches  earlier.  There  are  good  heuristics  that  allow  discovering  an  initial  solution. 
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Most  of  the  times  the  solution  obtained  is  close  to  the  optimal  one,  if  not  the  optimal 
solution  itself.  In  this  paper  we  used  a  heuristic  [3]  based  on  finding  a  point  which  is 
better  than  all  its  adjacent  points.  This  point  is  called  a  local  star  minimum  point. 


3  Parallel  Algorithm 


Different  implementations  of  branch-and-bound  algorithms  in  different  parallel 
architectures  (shared  memory  multiprocessor,  distributed  memory  multiprocessor  and 
vector  processors)  are  mentioned  in  the  literature  [2],  [4],  [11].  There  are  also 
references  [4],  [6]  to  the  most  common  anomalies  in  parallel  branch  and  bound 
algorithms,  as  the  behaviour  of  such  algorithms  is  unpredictable. 

A  branch  and  bound  tree  implicitly  enumerates  all  possible  solutions.  Branches  of  the 
search  tree  are  independent  subproblems,  so  they  can  be  evaluated  in  parallel.  A 
parallel  branch  and  bound  algorithm  generally  splits  the  tree  into  exactly  as  many 
subproblems  as  there  are  processors.  Then,  each  subproblem  is  executed  in  each 
processor,  for  a  specified  number  of  nodes  (Maxn)  of  the  branch-and-bound  tree. 
When  one  of  the  processors  completes  its  search  on  the  tree,  then  an  unsolved 
subproblem  of  another  processor  is  split  and  assigned  to  that  free  processor. 
Processors  also  change  information  about  new  incumbents. 

In  this  work,  initially  the  entire  problem  is  assigned  to  the  root  processor  and  the  range 
of  the  gradient  is  used  to  fix  all  possible  variables,  as  described  above.  Then  the  initial 
problem  is  split  among  all  the  processors.  We  choose  the  most  unlikely  variable  xj  to 
be  fixed  in  subsequent  levels  (as  in  the  sequential  algorithm)  and  split  the  tree  into  two 
subproblems.  In  the  4  transputers  case,  we  repeated  this  splitting  part  in  processor  1 
and  in  processor  2  in  order  to  send  subproblems  to  processors  3  and  4,  respectively. 
And  in  the  8  transputers  case,  processors  1,  2,  3  and  4  also  repeat  the  splitting  part  in 
order  to  send  subproblems  to  the  other  processors. 

After  a  specified  number  of  nodes,  the  processors  communicate  with  its  neighbours, 
change  the  incumbents  solutions  and  send  subproblems  to  the  free  processors  in  the 
same  way.  The  algorithm  stops  when  all  processors  are  free,  that  is,  when  the  search 
on  the  branch-and-bound  tree  is  completed. 

In  this  work  we  used  two,  four  and  eight  25  MHz  INMOS  transputers,  on  a  TMB16 
platform,  PC  hosted,  with  the  speed  of  links  set  at  20  Mbits/sec.  The  programming 
language  was  AINSI C. 

The  topologies  employed  with  4  and  8  transputers  are  shown  in  figs.  1  and  2. 
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Pi  =  Processor  i,  i  =  0, 1,2,3 
Fig.  1 .  Topology  for  a  4  transputers  network 


Pi  =  Processor  i,  i  =  0,...,7 
Fig.  2.  Topology  for  a  8  transputers  network 

In  the  sequential  branch  and  bound  algorithm  we  used  a  depth-first  strategy.  In  the 
parallel  one,  with  these  topologies,  subtrees  are  searched  in  depth,  one  in  each 
processor,  but  simultaneously,  the  processors,  all  together,  perform  a  breadth-search 
in  the  tree  because  right  and  left  branches  are  being  searched  at  the  same  time,  as 
shown  in  fig.  3  for  the  8  processors  case. 
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Fig.  3.  Search  tree  for  a  8  transputers  network 

5  Computational  Results 

To  test  the  efficiency  of  the  discrete  algorithm,  we  have  studied  its  behaviour  when 
attempting  to  solve  some  problems  whose  matrices  were  taken  from  the 
Harwell-Boeing  Collection  (available  in  http://gams.nist.gov/MatrixMarket)  from 
different  sets  (Structural  Engineering,  Partial  differential  equations,  Power  Systems 
Networks).  These  matrices  are  symmetric,  real  and  not  diagonal  dominant.  The 
number  of  variables  (n)  and  the  number  of  non-zero  elements  of  the  off  diagonal 
triangular  matrix  (m)  are  described  in  Table  1 . 
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Matrix  name 

n 

m 

bcsstk02 

66 

2145 

bcsstk03 

112 

264 

bcsstm07 

420 

3416 

gr3030 

900 

3422 

nosl 

237 

390 

nos6 

675 

1290 

nos7 

729 

1944 

494bus 

494 

586 

Table  1.  Matrices  dimensions 


For  each  matrix  problem,  we  randomly  generated  several  different  c  vectors  (the 
independent  term  of  (3)).  The  optimal  solution  was  obtained  in  the  initial  node, 
without  any  search  tree,  in  problems:  gr3030,  nos6,  nos7,  for  all  the  generated  c 
vectors.  For  each  one  of  the  other  problems  we  obtained  10  different  instances.  The 
characteristics  of  these  test  problems  are  described  in  table  2. 


Probl. 

Set 

matrix 

name 

variables 

number 

elements (1) 
>0  <0 

diagonal <2) 

>0  <0 

total <3> 
number 

fixed (4) 
variables 

min. 

max. 

min. 

max. 

min. 

max. 

min. 

max. 

1 

66 

987 

1158 

46 

49 

17 

20 

2211 

2211 

12 

16 

a 

112 

115 

149 

59 

89 

23 

53 

376 

376 

83 

89 

Q 

nosl 

237 

156 

234 

132 

141 

81 

83 

606 

614 

170 

185 

El 

bcsstmOi 

420 

2146 

1270 

256 

257 

163 

164 

3836 

3836 

382 

394 

494bus 

494 

0 

586 

286 

310 

184 

208 

1080 

1080 

438 

463 

(1 )  =  number  of  non  zero  elements  of  the  off  diagonal  triangiiar  matrix 

<2)  =  number  of  non  zero  elements  of  the  diagonal  plus  the  independent  term 

(3)  =  number  of  non  zero  elements  of  the  triangular  matrix 

(4)  =  number  of  variables  fixed  at  the  initial  node 

Table  2.  Characteristics  of  the  test  problems 


As  it  was  mentioned  before,  to  improve  the  efficiency  of  the  branch-and-bound 
algorithm  is  necessary  to  start  with  a  “good”  guess.  The  heuristic  that  we  described 
before  performs  this  task  with  good  results.  Most  of  the  times  the  initial  guess 
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obtained  by  the  heuristic  is  actually  the  optimal  solution  and  the  branch-and-bound 
algorithm  only  is  needed  to  confirm  this  optimality.  The  time  spent  with  the  heuristic 
was  not  included  in  our  results  because  it  has  always  been  less  than  0.1  seconds  and  is 
executed  sequentially. 

We  performed  some  preliminary  tests  to  determine  how  granularity  affects  speedup.  In 
what  concerns  the  number  of  nodes  of  the  search  tree,  the  smaller  Maxn  is,  the  better 
the  results  are.  On  the  other  hand,  on  what  concerns  speedup,  the  bigger  the  tree  is,  the 
greater  Maxn  should  be  used,  and  vice-versa.  Nevertheless  the  speedup  values  did  not 
vary  meaningfully  with  Maxn.  Actually,  although  a  frequent  change  of  information 
between  processors  reduces  the  search  tree  size,  this  is  more  time  consuming  and  for 
bigger  trees  speedup  becomes  worst,  as  the  shortness  of  the  tree  does  not  balance  the 
extra  increase  in  communication  time.  So,  since  we  have  an  estimate,  from  the 
sequential  branch  and  bound  algorithm,  of  the  tree  dimension  we  decided  to  use 
accordingly  values  for  Maxn,  as  shown  in  table  3. 


Number  of 
nodes*1  ( 

2 

Processors 

4 

8 

<  1000 

50 

25 

15 

>  1000 

100 

50 

25 

(1)  =  Performed  by  the  sequential  branch  and  bound  algorithm 


Table  3.  Values  of  Maxn 

We  began  our  computational  study  by  solving  the  test  problems  sequentially. 
Afterwards  we  applied  the  parallel  version  of  the  algorithm  with  2,  4  and  8 
transputers.  Table  4  summarises  the  obtained  results. 
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Table  4.  Summary  of  the  results  with  1,  2, 4  and  8  transputers 

In  this  table  the  “best”  and  the  “worst”  results  refer  in  the  sequential  case  to  the  best 
and  worst  execution  time  and  in  the  parallel  case  to  the  best  and  worst  speedup  for  2, 
4  or  8  processors,  so  we  are  not  always  talking  about  the  same  problem.  That’s  why 
we  are  going  to  focus  our  attention  in  the  average  line.  We  observe  that  the  first  three 
sets  of  problems  have  a  similar  behaviour  with  good  values  of  speedup  and  efficiency. 
Set  problems  4  and  5  behave  in  opposite  directions.  Set  problem  4  has  very  good 
values  of  speedup  and  efficiency,  reaching  in  some  cases  superlinear  speedup  (1,  3 
and  2  cases  with  2,  4  and  8  processors,  respectively).  Set  problem  5  has  poor  results 
obtaining  detrimental  speedups  through  all  cases.  These  results  confirm  the 
unpredictable  behaviour  of  the  branch  and  bound  algorithms.  The  non-common 
behaviour  of  the  problems  set  4  and  5  is  due  to  the  different  ways  of  searching  the 
tree,  in  sequential  and  in  parallel.  Actually,  in  the  sequential  version,  with  a  depth  first 
strategy,  the  right  branch  of  the  tree  is  explored  after  the  left  one  is  over,  while  in  the 
parallel  version,  with  the  used  topology,  both  branches  are  explored  simultaneously. 
The  good  behaviour  of  parallel  algorithm,  in  problem  set  4,  can  be  explained  by  the 
fact  that  the  number  of  nodes  of  the  branch-and-bound  tree  in  the  parallel  version  was 
significantly  smaller  than  in  the  sequential  version.  As  the  value  of  the  incumbent 
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solution  is  updated  frequently,  this  allows  to  prune  the  tree  earlier,  mainly  the  left 
branch.  So,  the  parallel  algorithm  usually  needs  to  search  fewer  nodes  than  the 
sequential  one,  providing  very  good  or  superlinear  speedups,  in  the  cases  where  the 
sequential  algorithm  stops  only  in  the  middle  or  in  the  end  of  the  right  branch  search. 
Otherwise,  in  set  5,  it  happens  that  the  sequential  algorithm  obtains  the  optimum  in  the 
end  of  the  left  branch  search  and  consequently  prunes  the  right  branch  of  the  tree,  in 
its  first  iterations,  while  the  parallel  algorithm  slowly  decreases  and  updates  the 
incumbent  value,  and  consequently  searches,  in  this  set  of  problems,  much  more 
nodes,  mainly  in  the  right  branch. 

A  reason  for  this  behaviour  may  have  to  due  with  the  number  of  positive  and  negative 
elements  of  the  off  diagonal  triangular  matrix  of  these  two  sets.  The  number  of 
bcsstm07  matrix  positive  elements  is  almost  twice  the  negative  ones,  while  494bus 
matrix  has  no  positive  elements  (table  2). 

Figs.  4  to  8  show  the  mean  speedup  obtained,  for  each  problem  set.  In  these  graphics 
the  dashed  lines  represent  linear  speedup. 


N  =  66  m  =2211 


Num  ber  of  processors 


N  =  112  m  =  376 


Number  of  processors 


Fig.  4.  Speedup  for  set  1 


Fig.  5.  Speedup  for  set  2 
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Fig.  6.  Speedup  for  set  3  Fig.  7.  Speedup  for  set  4 
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Fig.  8.  Speedup  for  set  5 

It  is  clear  that  the  speedup  curves  are  similar  in  figs.  4,  5  and  6.  In  fig.  7  speedup  is 
almost  linear.  In  fig.  8  speedup  never  gets  too  far  from  1 . 

The  parallel  algorithm,  as  it  can  be  seen  in  the  efficiency  columns  and  in  figs.  4  to  8, 
performs  better  for  2  processors  and  gets  worse  for  8  processors.  This  happens 
because,  with  more  processors,  the  communications  increasing  time  does  not  balance 
the  decrease  of  search  nodes. 

We  can  separate  the  problems  with  the  same  behaviour,  with  any  number  of 
processors,  into  two  sets.  One  contains  16  problems  which  achieved  good  efficiencies 
(greater  or  equal  than  0.6)  and  the  other  with  12  problems  achieved  bad  efficiencies 
(less  than  0.6).  For  these  two  groups  we  plotted,  in  fig.9,  the  average  number  of  nodes 
searched  in  sequential  and  in  parallel  with  2,  4  and  8  processors.  This  graph  confirms 
what  was  explained  above  about  the  reasons  for  superlinear  and  detrimental  speedups. 


2500 


Number  of  processors 

Fig.  9.  Efficiency  versus  number  of  searched  nodes 


We  also  can  use  these  two  groups  to  study  the  workload  distribution  among 
processors.  The  average  number  of  nodes  searched  by  each  processor  is  plotted  in 
figs.  10  and  1 1,  respectively.  In  fig.  12  we  show  the  average  values  when  considering 
all  the  problems. 
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Number  of  processors 

Fig.  10.  Average  number  of  nodes  searched  in  each  processor  for 
problems  with  good  efficiencies 
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Fig.  1 1 .  Average  number  of  nodes  searched  in  each  processor  for 
problems  with  bad  efficiencies 
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Fig.  12.  Average  number  of  nodes  searched  in  each  processor  for  all  the 
Problems 


It  is  clear  from  these  graphics  that  the  workload  is  well  balanced  among  processors,  in 
almost  all  cases.  In  spite  of  a  slight  irregularity  denoted  in  fig.  11,  there  are  no 
significant  differences  between  them. 

6  Conclusion 

In  this  paper  we  present  a  study  of  the  behaviour  of  a  parallel  algorithm  for  branch- 
and-bound  search  trees,  implemented  on  a  transputer  network.  The  results  obtained 
show  that,  generally,  the  use  of  parallel  processing,  for  these  kind  of  algorithms,  is 
worthwhile,  as  significant  savings  in  execution  time  can  be  obtained.  Nevertheless,  we 
must  be  aware  that  parallel  branch  and  bound  can  also  produce  poor  as  well  as 
superlinear  speedups. 

The  used  topology  appeared  quite  matched  for  this  kind  of  algorithms  as  it  enables  a 
concurrently  depth  and  breadth  tree  search  and  achieved  a  well-balanced  workload 
among  processors. 

We  are  aware  of  the  fact  that  tranputers  are  out  of  date,  nevertheless,  we  think  that  this 
study  remains  valid  in  other  architectures  with  the  same  ratio  between  the  processing 
capacity  and  communication  speed. 

In  future,  we  should  consider  the  use  of  different  topologies,  a  larger  number  of 
processors,  and  different  parallel  machines,  in  order  to  solve  large-scale  problems  in  a 
reasonable  time. 
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Abstract.  This  paper  presents  a  genetic  algorithm  that  addresses  the  real-time 
static  allocation  problem.  In  real-time  systems,  each  task  has  its  own  timing 
constraints.  A  correct  allocation  is  defined  by  schedulable  tasks  (no  deadline 
is  missed)  on  each  processor.  Our  algorithm  considers  both  scheduling  and 
allocation  and  one  major  contribution  of  this  work  is  that  it  relies  on  a  direct 
problem  representation  and  on  advanced  operators.  Here,  the  problem 
representation  clearly  expresses  tasks'  schedules  and  allows  to  directly 
manipulate  them.  We  define  new  genetic  operators  helping  making  choice 
between  tasks  that  miss  their  deadlines  and  deciding  where  to  move  them  in 
order  to  get  better  allocations.  A  parallel  implementation  of  the  algorithm  is 
presented  also  a  comparison  with  the  simulated  annealing  algorithm.  Results 
obtained  by  this  algorithm  are  promising  and  presented  at  the  end  of  this 
paper. 


Keywords:  real-time  systems,  task  scheduling,  static  allocation,  parallel 
genetic  algorithm. 


1  Introduction 

Real-time  scheduling  is  a  topic  where  tasks  have  to  be  scheduled  in  order  to 
respect  timing  constraints,  precedence  relations  and  resources  constraints.  Usually 
each  task  is  described  by  a  start  time,  a  computing  time  and  a  deadline  that  must  be 
met.  Tasks  can  be  either  periodic  or  aperiodic,  and  may  communicate  and  use 
resources.  Most  people  use  the  term  scheduling  for  both  monoprocessor  and 
multiprocessor.  In  our  work,  scheduling  is  the  way  to  arrange  a  set  of  tasks  on  a 
single  processor.  We  use  the  term  of  allocation  to  describe  the  way  in  which  tasks 
are  assigned  to  processors. 

Task  allocation,  processor  scheduling  and  communications  scheduling  are  all  NP- 
hard  problems  [4].  Thus  leads  to  a  view  that  they  should  be  considered  separately 
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and  many  researches  handled  each  of  these  concepts  independently  of  the  others. 
Most  related  work  on  task  allocation  has  mainly  concentrated  on  maximizing  the 
system  throughput  or  reducing  the  application  response  time  [5],  [9],  [8],  [17],  [13], 
[22].  These  algorithms  cannot  be  applied  to  real-time  tasks  because  they  rely  on 
randomness.  For  each  allocation  obtained,  these  algorithms  apply  a  post-allocation 
phase  to  determine  the  tasks'  schedules  -  a  task  schedule  is  the  order  of  tasks' 
execution  -  on  the  processors  and  to  check  the  respect  of  the  constraints. 

Among  approaches  that  address  real-time  static  allocation  problem,  we  can 
distinguish  constructive  approaches  (building  correct  allocations)  and  iterative  ones 
(modifying  the  current  allocation  in  order  to  get  a  better  one).  Peng  and  Shin  [18] 
solve  the  problem  using  two  Branch&Bound  algorithms  (one  for  task  allocation  and 
another  for  task  scheduling)  for  a  set  of  communicating  tasks.  Hou  and  Shin  [12] 
extend  this  work  to  duplicated  tasks.  Because  of  the  intractability  of  the  problem 
and  the  high  cost  of  optimal  approaches,  heuristic  algorithms  have  been  developed 
(see  the  work  of  Ramamritham  in  [19],  Davari  and  Dhall  in  [10]  Bums  and  all  in 
[21],  [2]  and  Cheng  and  Agrawala  in  [6]) 

Holland  developed  genetic  algorithms  (GA)  in  1975,  GA  are  stochastic  and 
iterative  search  algorithms  based  on  the  adaptive  process  of  natural  systems.  They 
rely  on  the  selection  and  the  survival  of  the  more  adapted  species.  GA  are 
characterized  by  individuals  representing  different  allocations,  a  set  of  genetic 
operators  used  to  create  new  individuals  and  a  cost  function  to  evaluate  the 
individuals.  GA  have  been  successfully  applied  to  a  wide  range  of  optimization 
problems,  but  only  few  approaches  have  tried  to  apply  them  to  real-time  task 
allocation  problem.  In  [15],  Kidwell  presents  a  GA  to  schedule  communicating  tasks 
that  can  be  extended  to  real-time  scheduling.  However  the  problem  representation 
used  does  not  include  any  consideration  for  scheduling.  In  [14],  the  GA  is  applied  to 
tasks  with  precedence  constraints,  in  [11]  a  GA  is  proposed  for  the  job-shop 
scheduling  problem  and  in  [3],  a  GA  is  applied  to  production  scheduling  problems. 
All  these  works  are  close  to  our,  but  no  algorithm  from  those  described  above  can  be 
applied  or  adapted  to  the  considered  model.  In  summary,  this  problem  has  to  be 
adequately  addressed. 

In  our  work  we  develop  a  genetic  algorithm  which  addresses  both  allocation  and 
scheduling.  We  adopt  a  representation  that  reflects  the  nature  of  the  problem  to 
solve.  Indeed,  our  representation  clearly  shows  the  tasks  schedules  on  each 
processor.  Besides,  we  develop  new  genetic  operators  specifically  adapted  to  this 
new  representation.  When  standard  operators  are  applied  randomly,  tasks' 
scheduling  is  not  considered  in  the  allocation  algorithm,  the  next  step  can  lead  to  an 
allocation  with  more  unschedulable  tasks  indeed  nothing  guides  the  way  tasks  are 
moved  between  processors. 

The  remainder  of  this  paper  is  organized  as  follows.  In  section  2,  we  introduce 
the  task  model  addressed  by  our  algorithm.  Section  3  presents  the  representation 
adopted  for  this  problem  and  a  method  for  generating  initial  population.  Sections  4 
and  5  describe  the  main  characteristics  of  our  genetic  algorithm:  the  cost  function 
that  evaluates  the  allocations  and  the  proposed  operators.  Section  6  presents 
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implementation  results  of  the  parallel  genetic  algorithm  and  a  comparison  with  the 
simulated  annealing  algorithm. 


2  The  System  and  Task  Model 

In  our  model  tasks  have  timing  constraints,  fault-tolerance  constraints  and  are 
communicating. 

Timing  constraints:  Each  task  Tj  is  described  using  3  parameters  Rj,  Ej,  Dj  where 
Rj  is  the  time  at  which  Tj  is  ready  and  can  begin  its  execution'  ,  Ej  its  computing 
time,  Dj  the  deadline. 

Tasks  allocated  to  a  processor  must  be  schedulable.  This  can  be  verified  with  the 
feasibility  test  of  EOF  the  Earliest  Deadline  First  approach  (EDF)1 2  [7].  In  this 
problem  the  feasibility  test  is  insufficient,  the  algorithm  must  schedule  tasks  in  order 
to  calculate  the  start  time  5/  at  which  the  task  is  scheduled  -  which  depends  on  the 
communications  with  7/-  and  C/  its  completion  time.  C/  =  5/  +  Ej. 

Fault-tolerance  constraints :  In  order  to  face  fault-tolerance  requirements,  some 
tasks  are  duplicated.  When  a  task  needs  some  replicas,  each  instance  must  be 
allocated  to  a  different  processor.  This  way,  if  the  processor  falls  down,  the  task 
execution  is  guaranteed  on  another  processor. 

Communications:  We  suppose  that  the  communications  take  place  after  that  tasks 
finish  their  computations.  When  calculating  the  start  time  of  a  task  Tj,  the  algorithm 
first  evaluates  the  delay  due  to  communications  with  Tj.  Indeed,  the  earliest  date  at 
which  Tj  can  be  scheduled  depends  on  the  reception  date  of  all  messages  sent  to  it 
(which  themselves  depend  on  the  completion  time  of  sending  tasks). 


3  The  Problem  Representation  in  the  Genetic  Algorithm 


A  genetic  algorithm  consists  of  four  steps.  The  generation  step  randomly  creates 
a  population  of  individuals.  Each  one  is  a  potential  allocation.  A  cost  function  is 
then  applied  to  evaluate  them.  The  values  obtained  will  determine  which  individuals 
will  be  selected  for  the  reproduction  step.  Applying  genetic  operators  generates  new 
individuals.  The  most  famous  are  mutation  and  crossover.  The  evaluation,  selection 
and  reproduction  steps  are  repeated  until  the  algorithm  converges. 

The  GA  are  known  to  have  a  good  convergence  when  the  following  conditions 
are  respected:  (1)  the  coding  of  the  individuals  correctly  reflects  the  problem  to 
solve,  (2)  the  individuals  are  in  a  one-to-one  correspondence  with  search  nodes  (i.e. 
each  individual  corresponds  to  a  legal  search  node  or  an  allocation.  Usually  each 

1  We  assume  that  tasks  have  no  precedence  constraints,  but  a  programmer  can  express  such 
constraints  through  Rj. 

2  EDF  is  a  dynamic  scheduling  algorithm  that  can  be  applied  to  either  periodic  or  aperiodic 
tasks.  The  feasibility  test  of  EDF  for  a  set  of  tasks  is  that  the  sum  of  the  utilization  factors 
of  the  tasks  be  less  than  or  equal  to  one. 
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individual  is  composed  of  genes  each  one  corresponding  to  a  task  and  the  value  of 
the  gene  indicates  the  processor  number  on  which  it  is  placed.  The  representation 
used  in  this  paper  is  based  on  the  list  schedules  of  tasks  on  each  processor.  An 
individual  reflects  for  each  task,  on  which  processor  it  is  placed,  and  at  which 
position.  Hence,  an  individual  is  composed  of  strings  corresponding  to  the  different 
processors  and  each  string  shows  the  order  in  which  tasks  will  be  executed  on  the 
processor.  We  define  a  correct  individual  as  one  where  all  tasks  are  represented 
(completeness)  and  only  once  (uniqueness).  A  correct  individual  is  presented  in  the 
following  figure.  On  processor  Pj,  4  tasks  are  allocated  and  scheduled  {Tj,  T2,  T3, 
T5}.  On  processor  P2,  tasks  Tg  is  scheduled  before  T4. 

Pj  Ti  |  T3  |  T5  |  T2  | 

P2  T6|T4| 

Fig.  1.  A  problem  representation  with  scheduling  considerations 


3.1  Generating  the  Initial  Population 

To  generate  individuals  with  correct  schedules,  we  have  at  least  to  generate 
strings  where  tasks  are  ordered  in  an  ascending  order  of  deadlines.  For  this  purpose, 
the  set  T  of  the  N  tasks  to  allocate  is  divided  in  classes  according  to  deadline  values. 

The  class  0  contains  all  tasks  that  have  their  deadlines  in  the  interval  [0,1]  where  1 
is  the  smallest  deadline  in  T.  Class  1  contains  tasks  whose  deadlines  are  between  ]1, 

1+1'],  class  2  contains  tasks  whose  deadlines  are  between  ]1+1',  1+21'] . 1'  is  function 

of  the  deadline  dispersion.  The  algorithm  used  to  generate  the  initial  population  is 
described  in  the  following: 

Algorithm  Generate-Population 

GP1.  [Initialize]  determine  1  and  choose  a  value  for  1' 

GP2 .  [Form  .  Classes]  separate  the  tasks  according  to 
their  deadlines  and  form  the  classes 

GP3 .  [Repeat  GP4  for  each  Pj  / j  varies  from  1  to  M-l] 

GP4.  [Allocate  Tasks]  Repeat  for  each  class  clk : 
randomly  generate  a  number  nb_tasks3  of  tasks  to  pick 
from  class (clk)  and  allocate  them  to  Pj. 

GP5 .  .[Allocate,  the  remaining  tasks]  assign  the 

remaining  tasks  in  each  class  to  the  last  processor  Pj^ 


3  Nb_tasks  must  be  less  than  cardinal  (clk)/M  +  1  in  order  to  allow  that  all  the  processors  be 
served. 
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Individual  II  { 1 ,2,2, 1 } 

PI  T,|T2|T3|T«|T7|T* 

P2  T5|T4| 

Individual  12  {1,1, 1,1} 

PI  Ti|T2|T6|T8 

P2  T3|T5|T7|T4| 

Individual  13  {0,1 ,2, 1 } 

PI  T3|T6|T7|T8| 

P2  Ti|T2|T5|T4| 

Class(O)  =  {  T,}  Class(l)  =  {T2,  T„  T5}  Class(2)  =  {  T6,  T,}  Class(3)  =  {  T„  I4} 

Fig.  2.  An  example  of  population  generation  of  8  tasks  to  place  on  2  processors.  Here  are 
3  examples  of  individuals.  The  notation  Individual  II  {1,  2,  2,  1}  means  that  nb_tasks  is 
respectively  1  for  class  0,  2  for  class  1  ....and  1  for  class  3. 1  and  1'  are  both  equal  to  3. 
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4  The  Cost  Function 

The  cost  function  of  such  problem  is  complex  and  has  to  respect  the  following 
characteristics  of  an  allocation: 

(1)  Tasks  meet  their  deadlines 

(2)  Replicas  are  allocated  to  different  processors 

These  are  hard  constraints,  besides  two  other  criteria  are  to  be  considered: 

(3)  Minimize  the  communication  costs 

(4)  Minimize  the  response  time  of  the  application 

f(A)  =  penalty_sched(A)  +  penalty_replica(A)  +  com_cost(A)  +  schedule_length(A) 

An  allocation  is  correct  if  the  characteristics  (1)  and  (2)  are  respected.  We  define 
Penalty_f(A)  as  the  sum  of  the  first  two  parameters.  It  is  equal  to  zero  when  a 
correct  allocation  is  reached.  Our  purpose  is  to  find  a  good  correct  solution  so  we  try 
to  reduce  communication  costs,  which  are  the  major  handicap  of  the  target  machines 
(parallel  and  distributed  machines  with  distributed  memories),  and  the  length  of  the 
allocation.  The  characteristics  (3)  and  (4)  help  choosing  between  correct  allocations 
hence  they  are  considered  only  once  Penalty_f(A)  is  equal  to  zero. 
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Calculating  the  penalty  due  to  replicas'.  Duplicated  tasks  are  indicated  in  a  matrix 
where  each  row  gives  the  index  of  a  task  and  its  replica.  This  component  calculates 
the  sum  of  replicas  on  the  M  processors,  nb_T(P)  is  the  number  of  tasks  on  a 
processor. 


M  «b_T(Pk )  nb_T(Pk ) 

penalty_replica(A)  =  ^  replicafi,  j) 

k= 1  i=l  7=7+1 

Replica(i.j)  returns  1  if  Tj  is  a  replica  of  Tj  and  0  otherwise. 

Calculating  the  costs'  communications :  these  costs  depend  on  the  amount  of 
information  to  exchange  between  two  tasks  and  on  the  distance  separating  the 
associated  processors. 

-  q(Tfc  Tj)  is  the  amount  of  information  exchanged  between  Tj  and  Tk 

•  d(Pp  Pj)  the  distance  between  Pj  and  Pj  defined  as  the  number  of  processors 
in  a  path  from  Pj  to  Pj  minus  1 . 

-  com_cost(T j,T j)  is  the  cost  of  the  communications  between  Tj  and  Tj. 
com_cost(Tj,Tj)  =  q(T j,T j)*d(P  j,  Pj)  when  Tj  is  placed  on  Pj  and  Tj  on  Pj. 

-  com_cost(P j,  Pj)  is  the  sum  of  all  the  com_cost(Tj,  Tj)  for  all  communicating 
tasks  placed  on  tne  two  processors. 

M  M 

com_cost(A)  =  ^  ^  com  _  cost(Pj ,  P  ■ ) 
j= i  i=j+i 

Reducing  these  costs  is  equivalent  to  reducing  the  distance  separating  Pj  and  Pj. 
In  that  case,  the  network  of  processors  used  to  calculate  the  distances  is  a  logical 
one.  Several  algorithms  exist  for  the  projection  of  a  logical  network  processor  on  a 
physical  one.  A  state  of  the  art  is  presented  in  [20], 

When  computing  start  and  completion  times,  the  algorithm  takes  communication 
delays  into  account  in  the  compute  of  start  times  of  receiver  tasks.  Let  com(T]J  be 
the  set  of  tasks  sending  a  message  to  I \.  To  calculate  the  start  time  of  Tk,  we 
compute  for  each  task  Tj  in  com(Tk)  the  delay  necessary  for  the  transmission  of  its 
message.  Tk  cannot  be  scheduled  before  its  ready  time  is  reached  or  before  it  has 
received  all  the  messages. 

Sk  =  max  {  Rj. ;  max{  Cj  +  com_cost(T k,  Tj)}  V  Tj  e  com(T0}. 

Calculating  the  penalty  due  to  unschedulable  tasks:  given  start  and  completion 
times,  to  determine  Pen(Tj)  the  penalty  of  a  task  the  algorithm  computes  Cj  -  Dj. 

We  then  calculate  Pen(Pj).  We  make  the  assumption  that  if  a  task  Tj  misses  its 
deadline,  and  makes  all  tasks  scheduled  immediately  after  it,  miss  their  deadlines 
too,  it  is  sufficient  to  discard  Tj  to  avoid  their  penalties.  For  this  reason,  we  do  not 
consider  the  sum  of  the  Pen(Tj)  but  their  maximum.  We  are  conscious  that  this 
situation  occurs  especially  when  tasks  have  small  times  between  their  completion 
times  and  their  deadlines.  Hence, 

M 

penalty_sched(A)  =  ^  max(C,  -  D,  ),V7;  e  nb_  T(Pj  )} 
j= i 
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Calculating  the  schedule  length:  schedule_length(A)  is  obtained  after  that  all 
tasks  have  been  allocated  and  scheduled  on  each  processor  : 

schedule Jength(A)  =  max  {  end(Pj),  for  j=l,  where  end(P)  =  max  {  C„  VTj 
in  nb_T(P)  } 

If  there  exist  an  allocation  with  a  smaller  schedule  length  than  another,  surely 
tasks  are  arranged  differently,  and  not  just  started  earlier.  Indeed  the  algorithm 
schedules  tasks  at  the  earliest  date  they  can  start. 

P,  Tl  li  T3  1 4T7  |7  Tg  I9  T4  |n  P,  Tj  |,  T2  U  T5  U  T7  |7  Tg 

P2  T2  I2  T5  |3  T6  |6  P2  T3  Is  t6  |7  t4  1 9 

Individual  1  j  Individual 

Fig.  3.  A  comparison  of  the  schedule  lengths  of  two  individuals.  The  numbers  at  the 
right  of  the  tasks  are  the  tasks'  completion  times. 


5  Design  of  New  Genetic  Operators 


The  main  function  of  the  genetic  operators  is  to  create  new  allocations,  based  on 
the  current  generated  allocations.  A  new  individual  is  created  by  combining  or 
rearranging  the  best  part  of  two  individuals.  This  part  can  be  a  string  or  a  sequence 
of  tasks  in  a  string. 

For  real-time  task  allocation,  the  genetic  operators  must  enforce  the  ascending 
order  of  tasks'  deadlines  within  a  string,  and  respect  the  notions  of  uniqueness  and 
completeness.  The  operators  selected  for  the  reproduction  are  the  mutation  and 
crossover.  Standard  mutation  exchanges  the  value  of  two  genes  (positions)  in  an 
individual.  When  the  representation  adopted  is  binary,  the  mutation  inverses  the 
value  of  the  gene  chosen.  The  crossover  selects  split  points  in  two  individuals  and 


exchanges  the  parts  at  the  right  of  the  split  points. 

Split  Points 


Parent 

Individuals 


.wwwsw 

kwwtVAWW 


Individual  before 
mutation 


Individuals 
obtained 


<a-,w\wwvl 


sWW 

WA 

Individual  obtained 

W///////fflA, 

s\\W 

V/A 

Applying  Crossover  Operator 


Applying  Mutation  Operator 


Fig.  4.  Examples  of  standard  mutation  and  crossover  applied  to  a  binary 
representation 


-313- 


FEUP  -  Faculdade  de  Engenharia  da  Universidade  do  Porto 


5.1  The  Crossover  Operator 


In  our  representation,  we  cannot  randomly  select  split  points,  indeed  this  can  lead 
to  illegal  individuals.  These  points  must  be  selected  according  to  deadlines  and 
classes.  In  figure  5  crossover  is  applied  randomly. 

Individual  I]  Individual  I2 


I 

PI  Tj  |,  T2  I3T3  |6  T4  |8 

P2  T5  |2  T6  |6T7  I,  T8|, 

Individual  generated  Ij' 

pi  Ti|t2|t6|t4| 

P2  T5|T6|T7|T8| 

Fig.  5.  Crossover  creating  illegal  individuals. 


Crossover  points  j 

PI  t3  |3  t5  |4  t6  |8  t4  |io 

P2  Tj  li  T2  I3  T7  |6  T8  |g 

Individual  generated  I2' 

PI  T3|T5|T3|T4| 

P2  Tj  |  T2  |  T?  |  Tg  | 

T3  is  duplicated  in  I2’  and  absent  in  lj'. 


Undoubtedly,  the  legality  of  new  created  individuals  is  related  to  the  selection  of 
the  crossover  points.  If  tasks  at  the  right  and  the  left  of  each  crossover  point  are  of 
different  classes,  we  guarantee  that  the  generated  individuals  will  not  have  tasks  of 
the  same  class  at  both  sides  of  the  crossover  point,  and  no  duplication  can  be  made. 
A  similar  condition  was  applied  to  tasks  with  precedence  constraints  in  [14]. 


Theorem  1  :  Let  Stringk  be  the  string  corresponding  to  the  k’"  processor  on  two 

individuals  I,  and  I2.  If  the  crossover  points  between  tasks  i,  j  and  i’,  j'  satisfy  the 
following  conditions,  the  created  individuals  will  be  legal  : 

(1)  cl(Tj)  <  cl(Tj) ,  (2)  cl(Ti')  <  cl(Tj') ,  (3)  cl(Ti)  =  cl(Tj') 

String,  on  I,  T,  |Tj  |Tj  |T,«|  Stringk  on  I2  Tp  |  Tj.  |  Tj.  |  Tp.  | 

The  two  inequality  (1)  and  (2)  enforce  the  notion  of  uniqueness.  Since  the  task  on 
the  right  of  the  crossover  point  in  12  (Tj)  is  different  from  the  one  at  the  left  on  II 
(7))  because  of  different  classes,  we  assure  that  neither  Ti  nor  Tj  will  be  duplicated 
and  tasks  in  String,  will  respect  the  ascending  order  of  the  classes.  The  equality  (3) 
assures  the  completeness  of  the  individuals.  Indeed  if  tasks’  classes  could  have  been 
different,  a  task  Tq  in  I2  that  satisfies  classf^)  <  class(r„)  <  class  (Tg)  will  not  be 
represented  in  the  schedule.  1 

To  apply  the  crossover,  we  need  to  determine  a  crossover  point  that  satisfies 
Theorem  1  on  each  string  of  the  individuals.  As  the  position  of  the  crossover  points 
can  be  anywhere  on  the  string  (not  necessarily  at  the  same  position  as  in  the  first 
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individual),  the  individual  can  have  strings  with  various  numbers  of  tasks  and  with 
different  schedule  lengths.  Figure  6  illustrates  a  correct  application  of  the  crossover. 


Individual  11 


Individual  12 


PI 

P2 


▼ 


Crossover  points  on  the  first 
string 


▼ 


Tj  |,  T2  |,T3  |6T4|8 
t5  Is  t6  U t7  Is  t8  In 


PI  T3  Is  T5  |4  T5  |s  T4  |10 

P2  Tj  |,T2|3  T7  |6  T8  U 


▲ 


Crossover  points  on  the  second 
string 


▲ 


Individual  obtained  I', 

PI  T1  li  t2  I3  T3  |6t6  l,o  t4  I12 

P2  I5  I2  T7  |5  T8  |v 

Fig.  6.  A  correct  application  of  the  crossover 


Individual  obtained  I’2 

PI  T3  U  t5  |4  T4  |6 

P2  Ti  |t  T2  |3  T6  It  T7  |10T8  |i2 


5.2  The  Mutation  Operator 

The  standard  mutation  operator  usually  applies  on  two  genes  in  an  individual. 
Our  proposed  mutation  operator  is  applied  on  two  tasks  belonging  to  different 
strings  in  a  given  individual.  We  not  need  apply  mutation  to  the  same  string,  indeed 
the  scheduling  algorithm  determines  the  earliest  completion  time  on  the  processor. 

The  proposed  adaptation  of  the  mutation  operator  is  based  on  the  reduction  of  the 
number  of  unschedulable  tasks.  Maybe  applying  the  standard  mutation  can  lead  to 
interesting  results:  a  randomly  picked  task  in  a  schedule  can  reduce  the  schedule 
length  on  that  processor  and  can  be  guaranteed  in  the  schedule  of  another  processor. 
In  this  way  we  cannot  preserve  the  correct  parts  of  a  schedule  -  where  all  tasks  meet 
their  deadlines  -  but  just  rely  on  random  behaviour. 

Thus,  the  GA  needs  to  know  on  each  processor,  the  number  of  unschedulable 
tasks  nb_Pen(P j.)  and  the  first  task  missing  its  deadline  T_Pen(Pj,).  We  first  search 
for  the  processor  that  have  the  highest  penalty  P ^  and  the  one  having  the  smallest 
penalty  Pfv-  The  first  task  missing  its  deadline  is  transferred  from  P^p  to  Pfr  or 
exchange?  with  a  task  having  a  higher  deadline,  depending  on  the  value  of  a 
parameter  min _pen  (the  minimum  penalty  to  proceed  to  transfer).  When  exchange  is 
done,  we  forbid  the  case  where  the  task  is  just  exchanged  with  a  task  as  penalizing 
as  the  first. 

A  variant  of  this  algorithm  is  to  take  the  most  penalizing  task  instead  of  the  first 

task  missing  its  deadline.  More  details  are  presented  in  [1].  ^ 

Let  us  take  the  individuals  obtained  in  figure  6.  The  created  individuals  I]  and 
I2'  have  higher  penalties  than  their  parents.  On  Ip  there  are  3  penalties  for  tasks  T3, 
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Tg  and  T4.  T_pen(Pj)  which  is  Tj  is  transferred  to  On  the  second  individual, 
nb_pen(P 2)  is  2  with  tasks  T7  and  Tg.  nb_pen(P2 )  is  not  superior  to  min _pen  so  the 
first  penalizing  task  Ty  is  exchanged  with  T4  which  has  a  higher  deadline.  Both 
obtained  allocations  are  correct  and  Ij"  has  the  optimal  schedule_length. 


Il"  P1  Tj  |,T2|3  t6  |7T4|9  i2-  pi 

p2  T3|3T5|4T7|7T8|9  P2 


T3  I3T5  |4T7  |7 
Tl  I.T2I3  T6  I7  T8  |9  T4  |n 


Fig.  7.  Examples  of  mutation  applied  after  crossover 


5.3  Parallel  Execution  of  the  Genetic  Algorithm 

The  excessive  cost  of  GA  is  the  main  handicap  for  their  implementation  on 
distributed  or  parallel  systems.  In  order  to  take  advantage  of  the  benefits  of  parallel 
systems,  three  parallel  models  for  the  execution  of  the  GA  were  designed: 

(1)  First,  we  have  the  centralized  model  where  a  master  processor  generates  the 
initial  population  and  distributes  the  individuals  on  a  farm  of  processors.  Each 
processor  (slave)  executes  the  reproduction  and  selection  steps  and  sends  the  best 
individual  to  the  master  processor.  This  latter  proceeds  to  the  replacement  of  bad 
individuals.  One  advantage  is  that  at  any  moment,  the  best  individual  of  a 
population  can  be  known  and  even  put  in  the  next  population.  However, 
communication  costs,  which  are  the  main  handicap  of  these  systems  grow 
exponentially  with  the  population  size. 

(2)  In  the  second  model,  the  population  is  divided  in  equal  size  subpopulations  of 
individuals,  each  subpopulation  being  placed  on  a  processor.  Individuals  are 
reproduced  within  a  processor  and  exchanged  between.  This  approach  is  interesting 
when  the  population  size  is  greater  then  the  number  of  processors.  However,  the 
parallelism  internal  to  a  subpopulation  is  not  exploited. 

(3)  In  the  parallel  model,  each  individual  is  placed  on  a  processor.  Hence,  all 
phases  from  selection  to  replacement  are  done  in  parallel.  The  processors  exchange 
their  individuals  with  their  physical  neighbours.  This  choice  reduces  communication 
costs  and  fully  uses  the  parallelism  of  the  GA  steps.  For  these  reasons,  we  adopt  it. 

Algorithm  Parallel -Genetic-Algorithm 

PGA1 .  [Initialize]  generate  initial  population  and 
place  one  individual  on  each  processor 

PGA2.  [Compute  penalty_f  function]  schedule  tasks 
within  the  local  individual  and  calculate  the 
scheduling  and  replication  penalties 

PGA3 .  [Repeat  until  convergence]  while  a  maximum  number 
of  iterations  is  not  reached  do  steps  PGA3  to  PGA7  on 
each  processor 
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PGA4.  [Communication  step]  send  the  local  individual  to 
neighbours  and  wait  for  theirs . 

PGA5 .  [Selection  step]  select  the  individual  among  the 
4  received  ones  that  has  the  less  value  of  penalty — f 

PGA 6 .  [Reproduction  step] 

-  apply  crossover  (to  the  local  individual  and  the 
selected  one) , 

-  schedule  tasks  within  each  individual  and 
evaluate  the  penalties  due  to  unschedulable  tasks, 

-  apply  mutation  to  the  individuals  obtained 

-  reschedule  tasks  and  evaluate  all  the  cost 
function  in  case  of  penalty_f  equal  to  zero. 

PGA7 .  [Replacement  step]  the  local  individual  is 

if  one  of  the  generated  individuals  that  have 
a  fewer  cost  function. 


6  Performance  Evaluation 

The  PGA  was  implemented  on  the  supemode,  a  120  transputer  based  machine 
with  no  shared  memory.  Evaluation  presented  was  done  with  tasks  sets  of  eight  and 
thirty  to  show  that  the  algorithm  has  good  results  either  with  few  tasks.  Four 
benchmarks  are  presented: 

•  B1  :  a  graph  composed  of  8  tasks  with  no  replication 

•  B2  :  the  same  graph  with  3  tasks  replicated 

•  B3  :  a  graph  of  30  tasks  with  no  replication 

•  B4  :  the  same  graph  with  5  tasks  replicated 

For  each  benchmark,  the  PGA  was  run  20  times.  NI  is  the  number  of  iterations 
necessary  to  obtain  a  correct  allocation  and  ET  is  the  execution  time  given  in 
seconds.  AV  is  the  average.  Table  1  shows  that  the  algorithm  rapidly  converges 
when  no  task  is  replicated.  It  is  obvious  that  searching  for  the  first  penalizing  task  is 
less  expensive  than  searching  for  the  replicated  tasks. 


Table  1 .  Performance  evaluation  of  the  PGA 


■ 

NI 

ET 

Min 

Max 

AV 

Min 

Max 

AV 

B1 

1 

2 

1 

0.05 

0.08 

0.075 

B2 

1 

20 

9 

0.060 

0.090 

0.085 

B3 

2 

33 

13 

0.8 

3.025 

1.25 

B4 

ID 

100 

40 

0.7 

14.75 

6.035 
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The  genetic  operators  are  usually  applied  with  a  probability.  The  crossover 
operator  we  developed  is  applied  only  when  split  points  can  be  found  so  no 
probability  can  affect  the  algorithm  convergence  with  crossover.  However  we 
conceived  the  mutation  operator  to  reduce  the  scheduling  penalty  and  we  remark 
when  it  is  applied  that  since  the  first  iterations  the  scheduling  penalty  is 
considerably  reduced. 


Figure  8  depicts  the  observed  number  of  iterations  with  different  mutation 
probabilities.  A  reasonable  value  for  the  mutation  probability  is  0.9. 

Fig.  8.  Observed  numbers  of  iterations  with  different  mutation  probabilities 


6.1  Comparison  With  the  Simulated  Annealing  Algorithm 

We  applied  the  simulated  annealing  algorithm  SAA  to  the  same  set  of  tasks,  in 
order  to  show  how  fast  our  algorithm  is.  The  SAA  has  been  applied  to  solve  static 
allocation  of  real-time  tasks  in  the  following  works  [21],  [2]  and  [6]. 

The  SAA  uses  a  population  of  different  energy  states  each  of  them  corresponding 
to  an  allocation.  To  each  state  is  associated  a  temperature  which  is  reduced  at  each 
iteration  in  order  to  obtain  the  lower  energy  point  which  represents  the  best 
allocation.  Neighbouring  allocations  are  obtained  by  choosing  a  task  and  moving  it 
to  a  randomly  selected  processor.  The  SAA  cannot  help  in  building  correct 
allocations,  as  does  the  PGA,  it  can  just  estimate  a  solution  using  a  cost  function. 

We  had  to  develop  a  sequential  version  of  the  algorithm  because  the  SAA  cannot 
be  paralellized  easily.  The  SAA  was  applied  with  an  initial  temperature  of  10  and  a 
minimal  one  of  0. 1 . 
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Abstract.  The  growing  market  of  embedded  systems  and  applications  has  led  to 
the  making  of  more  general  embedded  processors,  with  some  features 
traditionally  associated  with  general-purpose  microprocessors.  Following  this 
trend,  recent  research  has  tried  to  incorporate  into  embedded  processors  the 
newest  techniques  to  break  down  ILP  limits.  Value  speculation  is  a  recent 
technique  not  yet  considered  in  the  context  of  embedded  processors,  and  the 
goal  of  the  present  work  is  to  analyse  the  performance  potential  of  this 
technique  within  this  scope. 


1  Introduction 

Over  the  last  few  years,  the  increasing  number  of  communication  and  multimedia 
applications  has  brought  about  a  growing  demand  for  high  performance  in  embedded 
computing  systems  [1],  [2],  and  many  of  the  techniques  for  extracting  Instruction- 
Level  Parallelism  (ILP),  traditionally  used  in  high  performance  general-purpose 
systems,  are  being  applied  to  embedded  processors  [3].  The  limits  on  the  amount  of 
extractable  ILP  are  due  to  the  program  dependencies,  and  data  dependencies  present  a 
particularly  major  hurdle.  Through  value  speculation,  it  is  possible  to  counteract  data 
dependencies  and  thus  increase  the  program’s  degree  of  parallelism. 

The  value  prediction  technique,  like  branch  prediction,  allows  temporal  violation 
of  the  program  constraints  without  affecting  its  semantics.  Based  on  the  previous 
history  of  program  execution,  the  hardware  predicts  at  run-time  the  outcome  of  an 
instruction,  which  is  used  by  the  consumer  instructions  when  the  real  data  is  not  yet 
ready.  When  the  true  data  becomes  available,  it  is  compared  with  the  predicted  value, 
and  in  the  case  of  a  mismatch,  the  instructions  are  re-executed  with  the  correct  value. 

In  the  context  of  general-purpose  microprocessors,  the  performance  potential  of 
this  relatively  recent  technique  has  been  shown  to  be  significant  in  a  number  of 
studies  [4] [5].  Our  intuition  is  that  multimedia  and  communication  programs  present  a 
more  highly  predictable  (value)  behavior  than  normal  programs,  due  to  the  nature  of 
both  the  algorithms  and  the  input  data.  The  objective  of  this  work  is  the  application  of 
value  prediction  techniques  in  the  ambit  of  embedded  processors  and  the 
demonstration  of  a  better  efficiency  within  this  scope. 

To  achieve  this  comparative  analysis  we  have  collected  results  for  the  integer 
SPEC95  and  MediaBench  [6]  benchmarks.  We  used  integer  SPEC'95  as  an  evaluation 
benchmark  in  the  context  of  general-purpose  systems  and  MediaBench  (composed  of 
applications  culled  from  image  processing,  communications  and  DSP  applications)  as 
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a  representative  benchmark  set  for  embedded  computing  systems.  First,  we  perform  a 
predictability  analysis,  and  we  prove  that  the  output  values  of  the  MediaBench 
programs  are,  on  average,  more  predictable  than  the  SPEC95  programs,  using  several 
low-cost  configurations  of  different  predictor  models.  However,  predictability  results 
are  not  enough  to  justify  the  use  of  extra  hardware  to  predict  values,  but  it  is  essential 
to  prove  that  processor  performance  is  also  improved.  So  in  addition,  we  perform 
detailed  timing  simulations  in  order  to  compare  the  speedup  achievable  by  using 
value  prediction  in  two  typical  architectures  -  a  high-performance  embedded 
processor  architecture  and  a  high-performance  general-purpose  processor  architecture 
-,  and  we  prove  that,  using  a  low-cost  value  predictor,  an  embedded  processor 
running  the  MediaBench  programs  can  profit  much  more  from  value  prediction  than  a 
general-purpose  processor  running  the  SPEC  programs. 

The  paper  is  organised  as  follows.  Section  2  summarises  the  previous  work  on  data 
value  prediction.  Section  3  describes  the  experimental  framework.  Section  4  presents 
a  comparative  analysis  of  value  predictability  for  different  predictor  models.  Section  5 
describes  the  two  machine  models  used  in  the  timing  simulations  and  the  speedup 
results.  Finally,  section  6  presents  the  conclusions  and  future  work. 


2  Related  Work 

Early  work  on  value  prediction  [7]  showed  that  instructions  exhibit  a  new  kind  of 
locality,  called  value  locality,  which  means  that  the  values  generated  by  a  given  static 
instruction  tend  to  be  repeated  for  a  large  fraction  of  the  execution  time.  This  property 
allows  the  data  to  be  predictable.  In  a  later  work,  Sazeides  et  al.  [4]  state  that  the 
predictability  of  a  value  sequence  is  a  function  of  the  sequence  itself  and  the  predictor 
used.  In  this  way,  we  can  find  some  kinds  of  predictable  sequences,  like  for  example 
the  stride  sequences,  that  do  not  exhibit  value  locality. 

Most  of  the  value  predictors  proposed  in  the  literature  fit  into  one  of  the  following 
types:  Last-value  predictors  (LVP),  which  make  a  prediction  based  on  the  last 
outcome  of  the  same  static  instruction,  and  can  correctly  predict  constant  sequences  of 
data.  [7],  [8].  Stride  predictors  (SP),  which  make  a  prediction  based  on  the  last 
outcome  plus  a  constant  stride,  and  can  correctly  predict  arithmetic  sequences  of  data 
(even  constant  sequences,  whose  stride  is  0),  [8],  [9].  Context-based  predictors 
(CBP),  which  learn  the  values  that  follow  a  particular  context  and  make  a  prediction 
based  on  the  last  values  generated  by  the  same  instruction.  They  can  correctly  predict 
repetitive  sequences  of  data  [4],  [8].  Hybrid  predictors  (HP),  which  combine  some  of 
the  previous  predictors  and  include  a  selection  mechanism,  which  is  either  hardware 
[8],  [10],  [11],  or  software  [12].  To  date,  most  of  the  implementations  of  these 
predictors  have  been  simulated  in  the  context  of  general-purpose  superscalar 
processors  using  SPEC’95  as  the  evaluation  benchmark  suite.  The  results  obtained 
are  very  promising:  on  average  we  can  correctly  predict  about  50%  of  the  output 
values  of  a  program  and  obtain  about  a  20%  improvement  in  speedup  [10],  [11],  [4], 
But  to  obtain  these  results,  sophisticated  and  expensive  predictors  are  needed,  which 
nowadays  are  difficult  to  implement  due  to  the  current  technology. 

In  the  context  of  embedded  processors,  we  can  find  several  studies  which  try  to 
improve  performance  by  applying  techniques  traditionally  used  in  the  ambit  of 
general-purpose  processors.  However,  value  prediction  is  a  recent  technique  not  yet 
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applied  to  these  kind  of  processors.  The  reason  for  this  lies  in  area  restriction,  a  major 
challenge  in  embedded  systems,  which  makes  unfeasible  the  inclusion  of  very 
expensive  hardware  to  predict  values  in  the  processor.  Nevertheless,  this  is  not  the 
case  here,  since  a  very  small  predictor  table  is  needed  for  this  particular  kind  of 
applications  as  we  will  show  later. 


3  Experimental  Framework 

This  section  describes  the  framework  employed  in  our  research  to  obtain  the 
experimental  results.  We  performed  our  experiments  on  simulators  derived  from  the 
SimpleScalar  3.0  toolset  (PISA  version)  [13],  a  suite  of  functional  and  timing 
simulation  tools. 

As  we  mentioned  above,  we  collected  results  from  the  integer  SPEC95  and 
MediaBench  (MB)  [6]  benchmarks,  whose  characteristics  are  shown  in  Tables  1  and  2 
respectively. 


Tabic  1.  SPEC95  integer  benchmark  statistics. 


BENCH. 

DESCRIPTION 

INPUT  SET 

#  INST. 

%LOAD 

%INT 

Compress95 

Data  compression 

30000 e 2231 

95  M 

21.35 

46.03 

Cel 

Compiler 

Ref.  Input  (gcc.i) 

203  M 

26.05 

39.95 

Go 

Game 

99 

132  M 

20.66 

57.16 

lipec 

Jpeg  encoder 

Train  Input  (specmum.ppm) 

553  M 

17.63 

65.21 

M88ksim 

M88000  Simulator 

Train  Input 

120  M 

18.98 

49.82 

Perl 

PERL  interpreter 

Train  Input  (scrabbl.in) 

40  M 

27.83 

34.97 

Li 

LISP  emulator 

Train  Input 

183  M 

25.90 

34.74 

Vortex 

Data  base 

Train  Input 

30.67 

30.82 

Table  2.  MediaBench  suite  characteristics. 


BENCH. 

DESCRIPTION 

INPUT  SET 

#INST. 

%LOAD 

%INT 

Jpeg 

JPEG  image  comp  /  decomp 

Testimg.ppm 

20  M 

22.73 

55.75 

Mpeg 

MPEG-2  video  encod  /  decod 

Rec’.YUV 

1300  M 

25.41 

51.69 

Gsm 

GSM  speech  encod  /  decod 

Clinton  .pern 

306  M 

14.88 

72.47 

G.721 

Voice  comp  /  decomp 

Clinton.pcm 

546  M 

13.50 

59.13 

Pegwit 

Public  key  encr  /  deer 

Pgptest.plain 

50  M 

20.98 

61.28 

Pgp 

Public  key  encr  /  deer 

Pgptest.plain 

153  M 

17.31 

67.57 

Ghostscript 

PostScript  interpreter 

Tiger.ps 

1300  M 

14.31 

56.21 

Mesa 

3-D  graphics  library 

N/A 

8  M 

23.22 

46.10 

Rasta 

Speech  recognition 

Ex5_cl.wav 

39  M 

21.60 

45.14 

Epic 

Image  comp  /  decomp 

Test_image.pgm 

59  M 

12.87 

53.87 

Adpcm 

Audio  encod  /  decod 

Clinton.pcm 

12  M 

6.79 

62.99 

The  ijpeg  program  belongs  to  both  benchmark  suites,  but  despite  the  name  they  are 
quite  different,  since  not  only  are  the  library  versions  different  but  so  too  are  the  ways 
they  used.  The  input  files  and  the  program  parameters  of  the  test  programs  are 
different  as  well.  The  SPEC95  version  of  JPEG  was  modified  because  the  cjped  and 
djpeg  routines,  for  compression  and  decompression,  required  too  much  acceptable  I/O 
traffic  to  conform  to  SPEC  CPU  guidelines;  this  was  overcome  by  reading  the  image 
into  a  memory  buffer,  and  processing  it  repeatedly  with  different  compression 
settings. 
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The  majority  of  the  MediaBench  programs  are  composed  of  two  applications; 
compression/decompression  or  coding/decoding.  We  have  combined  the  results  for 
the  two  applications  by  first  executing  the  compression  or  coding  program  and  then 
the  decompression  or  decoding,  putting  the  data  obtained  together.  The  programs 
were  compiled  with  the  gcc  compiler  included  in  the  tool  set,  using  the  optimization 
level  03.  Due  to  time  constraints,  we  have  only  simulated  100  million  instructions. 


4  Predictability  analysis 


In  this  section  we  analyze  and  compare  value  predictability  for  the  MediaBench 
and  SPEC95  benchmark  suites.  This  analysis  is  based  on  the  percentage  of  program 
values  that  can  be  correctly  predicted.  Our  main  purpose  is  to  demonstrate  that  typical 
embedded  applications  exhibit  a  more  predictable  value  behavior  than  normal 
application,  especially  for  low-cost  predictors. 

As  mentioned  before,  the  predictability  of  a  value  sequence  is  a  function  of  both 
the  sequence  itself  and  the  predictor  employed.  Therefore,  in  order  to  accurately 
compare  several  program  sets,  it  is  necessary  to  carry  out  experiments  for  all  the 
existing  predictor  models.  Furthermore,  we  must  consider  that  using  idealized 
predictors  (infinite  tables)  it  is  possible  to  evaluate  the  theoretical  value  predictability 
of  programs  [4],  although  this  is  not  our  goal.  On  the  contrary,  we  want  to  empirically 
assess  the  program  predictability  by  using  realistic  and  low-cost  implementations  of 
the  predictor  models  (limited  table  size).  From  this  pragmatic  analysis  we  should  be 
in  a  position  to  foresee  some  of  the  performance  results  presented  later,  and  we  should 
also  be  able  to  select  the  most  suitable  value  predictor  for  embedded  processors. 


4.1  Predictor  models 

We  should  first  introduce  the  particular  low-cost  implementations  of  the  predictor 
models  which  are  employed  in  this  work.  In  view  of  the  fact  that  the  last  value 
predictions  are  special  stride  predictions  (with  zero  stride),  only  stride,  context-based 
and  hybrid  prediction  schemes  are  considered.  An  initial  analysis  of  each  benchmark 
value  behavior  is  also  presented  below. 

Stride  predictor  implementation.  The  SP  is  implemented  by  means  of  a  direct 
mapped  table.  The  table  is  indexed  using  the  least  significant  bits  of  the  instruction 
PC.  Each  table  entry  stores  the  following  information:  the  last-value  produced  by  the 
instruction  (32  bits),  the  stride  between  the  two  last  outputs  of  the  instruction  (8  bits), 
and  the  confidence  bits.  The  percentages  of  values  correctly  predicted  (also  called 
predictor  efficacy)  for  both  program  suites,  MediaBench  (MB)  and  SPEC95,  are 
shown  in  figure  1 . 

Looking  at  the  results  presented  above,  the  first  remark  that  should  be  made  is  that, 
apart  from  gsm  and  pegwit,  a  considerably  high  percentage  of  the  MB  program  values 
could  be  correctly  predicted  by  the  SP  (40%-50%)  and  very  small  tables  are  needed  to 
achieve  these  results.  Furthermore,  except  for  three  of  the  eleven  programs  that  make 
up  the  MB  suite,  almost  the  same  percentage  of  correct  values  could  be  obtained  by 
using  a  256-entiy  table  or  by  using  a  4K-entry  table.  On  the  other  hand,  looking  at  the 
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results  for  the  SPEC95  benchmarks  we  can  observe  an  appreciably  different  behavior. 
For  most  of  the  programs  the  predictor  table  size  has  a  significant  influence  on 
efficacy  and  the  results  are  not  particularly  outstanding.  Nevertheless,  the  m88ksim 
program  exhibits  a  particularly  high  value  predictability,  and  thus  appreciably  raising 
the  average  results. 


a)  MediaBench  b)  SPEC95 


Fig  1.  SP  efficacy  for  256,  5 1 2,  1 K,  2K  and  4K-entry  tables. 


Context-based  predictor  implementation.  The  CBP  is  derived  from  the  work  of 
Sazeides  et  al.  [14]  and  it  uses  a  2-level  table.  The  first  level  table,  called  the  Value 
History  Table  (VHT)  is  direct  mapped  and  it  is  indexed  using  the  least  significant  bits 
of  the  instruction  PC.  This  table  stores  an  order-3  context  composed  by  the  last-value 
produced  by  the  instruction  (32  bits),  and  two  strides  between  the  3  last  outcomes 
produced  by  the  instruction  (8  bits  each).  The  second  level  table,  called  the  Value 
Prediction  Table  (VPT)  is  indexed  by  a  hash  function,  which  uses  context  information 
from  the  VHT.  The  VPT  is  responsible  for  storing  the  value  prediction  (32  bits)  and 
the  confidence  estimation  for  each  context.  The  hash  function  shift-xor-fold  (also  used 
for  indexing  the  2"d  level  table  in  the  hybrid  predictor),  shown  in  figure  2,  differs  from 
the  original  one  proposed  by  Sazeides,  and  significantly  reduces  the  aliasing  in  the 
VPT  (especially  for  small  tables). 


Order  -3  Context 


Stride  t  Stride  ft  l  ««t 


Fold 


Fig  2.  CBP  hash  function. 

Figure  3  presents  the  efficacy  results  for  several  different  CBP  configurations 
(described  in  table  3).  In  contrast  to  the  SP,  we  now  remark  on  the  significant 
influence  of  table  size  on  predictor  efficacy  for  both  sets  of  benchmarks  —  increasing 
the  size  of  the  prediction  table  from  256  up  to  4K  entries  doubles  the  CBP  efficacy  for 
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most  of  the  programs  It  is  also  important  to  highlight  that,  although  for  the 
SPEC95  suite  the  results  of  the  CBP  seem  slightly  worst  than  for  the  SP,  for  the  MB 
set  we  appreciate  a  significant  improvement  on  value  predictability,  especially  for 
gsm,  which  now  exhibits  good  predictability. 


a)  MediaBench  b)  SPEC95 

Fig  3.  CBP  efficacy  for  256,  512,  IK,  2K  and  4K-entry  VPTs. 


Hybrid  predictor  implementation.  The  traditional  approach  of  implementing  hybrid 
predictors  is  based  on  dissociated  predictors  and  a  selection  mechanism.  Each 
individual  predictor  produces  its  own  prediction,  and  the  selection  logic  is  responsible 
for  choosing  the  more  suitable  for  the  current  instruction.  However,  when  the 
predictable  instruction  sets  of  the  predictors  are  highly  overlapped,  the  hardware 
efficiency  of  this  approach  is  low  because  it  uses  duplicated  hardware  for  predicting 
the  same  instructions.  The  hybrid  predictor  employed  in  this  paper  is  based  on  a 
previous  work  presented  in  [11].  Instead  of  using  dissociated  predictors  schemes,  it 
uses  overlapped  ones  and  a  finite  state  machine  based  on  value  sequence 
classification,  which  decides  when  it  is  necessary  to  use  each  part  of  the  predictor. 
The  key  idea  behind  this  approach  is  to  use  the  extra  hardware  only  when  it  necessary 
for  predicting  a  particular  value  sequence.  This  way  for  constant  sequences  it  only 
uses  the  last-value  table,  for  stride  sequences  (not  constant)  it  uses  both  the  last- value 
and  stride  tables,  and  for  non-stride  sequences  it  uses  in  addition  the  second  level 
table.  Notice  that  this  hybrid  predictor  only  produces  a  prediction  at  one  time.  The 
block  and  state  diagrams  of  this  predictor  are  shown  in  figure  4,  for  more  details 
please  see  [11], 

Several  different  configurations  are  possible  for  this  kind  of  predictor,  since  each 
of  the  tables  can  be  of  a  different  size.  In  this  work  we  have  elected  HP  configurations 
with  the  same  cost  as  the  CBP.  The  configurations  employed  are  described  in  table  3 
and  the  HP  efficacy  results  are  shown  in  figure  5. 


Table  3.  Predictor  configuration. 


Predictor 

Configuration 

A 

B 

C 

D 

E 

Stride 

E 

256 

512 

1024 

2048 

4096 

Context 

Evht 

128 

256 

512 

1024 

1024 

EVpt 

256 

512 

1024 

2048 

4096 

Hybrid 

Elast  =  Estride 

128 

256 

512 

1024 

1024 

256 

512 

1024 

2048 

4096 
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LAST  VALUE 


Fig  4.  Hybrid  predictor. 


a)  MediaBench  b)  SPEC95 

Fig  5.  Hybrid  predictor  efficacy  for  256,  512,  IK,  2K  and  4K-entry  VPTs. 


From  the  results  presented  above  we  can  comment  that,  in  general,  program 
predictability  is  higher  for  the  hybrid  predictor  than  for  other  predictors.  Nevertheless, 
variations  can  be  observed  depending  on  the  suite  under  consideration.  For  the  MB 
suite,  a  remarkable  increase  in  predictability  can  be  noticed  for  all  the  programs, 
while  for  the  SPEC95  set  the  previous  remark  is  only  true  for  a  few  benchmarks.  In 
all  other  aspects,  the  behavior  of  the  HP  is  similar  to  that  of  the  previous  predictors. 
With  respect  to  the  predictability  of  pegwit,  although  significantly  better  than  for  the 
SP  or  CBP,  we  observe  once  more  that  it  is  particularly  poor  compared  to  the  other 
programs  of  the  MB  set  (this  is  not  true  if  compared  to  SPEC95  programs).  The 
reason  lies  in  the  nature  of  the  program  itself.  Pegwit  is  a  program  for  public  key 
encryption  and  its  structure  has  been  chosen  specifically  to  avoid  redundancy  and  so 
be  resistant  to  cryptanalysis  methods  [15]. 

4.2  Comparative  results 

For  the  sake  of  highlighting  the  differences  between  both  benchmark  suites,  we 
compare  the  average  efficacy  results  as  a  function  of  the  predictor  cost.  Furthermore, 
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these  comparison  also  helps  us  to  select  the  best  predictor  (i.e.  best  balance  between 
efficacy  and  cost). 

The  most  complex  structures  of  the  predictor  are  the  prediction  tables,  and 
therefore  we  propose  using  the  global  table  size  as  a  measure  of  the  predictor  cost. 
Table  4  describes  the  formulae  used  to  calculate  the  overall  size  of  the  predictors  (E 
represents  the  number  of  table  entries  and  N  represents  the  number  of  entiy  field  bits). 


Table  4.  Cost  formulae. 


Predictor 

Global  Table  Size 

Stride 

E  *  (N VALUE  +  N STRIDE) 

Context 

Evht  *  (Nvalue  +  2  *  Nstride)  +  Evpt  *  Nvalue 

Hybrid 

Elast  *  Nvalue  +  Evpt  *  Nvalue  +  2  *  Estride  *  Nstride 

Figure  6  shows  the  average  results  for  both  sets  of  programs  as  a  function  of  the 
cost.  We  have  computed  two  different  means  in  order  to  evaluate  the  uniformity  of 
the  program  suite  behavior:  the  normal  average  and  the  so  called  realistic  mean, 
calculated  as  the  arithmetic  mean  of  all  programs  except  those  with  the  best  and  worst 
behaviors.  In  general,  we  observe  that  in  the  case  of  the  MediaBench  suite,  both 
means  are  practically  equal,  but  for  the  SPEC95  benchmarks  the  average  is  about  5% 
above  the  realistic  mean.  This  indicates  a  more  homogeneous  behavior,  in  terms  of 
value  predictability,  in  the  MediaBench  set  than  in  the  SPEC95  set  (  which  is  more 
sensitive  to  the  outstanding  behavior  of  the  m88ksim  program). 


a)  SP  b)  CBP 


c)  HP 

Fig  6.  Comparative  results  for  MB  and  SPEC  benchmarks. 


These  results  also  show  that  predictability  is  higher  for  the  MediaBench  suite  in  all 
circumstances,  and  that  the  difference  between  both  benchmark  sets  is  more 
prominent  for  small  predictor  costs,  decreasing  as  cost  grows.  This  comparatively 
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high  predictability  of  the  MediaBench  programs  may  lie  in  the  following  reasons. 
First,  they  have,  on  average,  much  more  integer  and  less  load  instructions  than  the 
SPEC95  programs  (see  tables  2  and  3)  -  in  fact  these  instructions  are  the  most 
predictable  instructions,  as  shown  in  [4]  — .  Second,  MediaBench  applications  exhibit 
more  loop  intensive  structures  and  more  redundancy  in  the  input  data  (images,  voice, 

video...)  than  SPEC95  programs.  .  . 

Comparing  the  different  predictors  in  the  case  of  embedded-processors,  it  is 
obvious  that  the  hybrid  predictor  exhibits  the  best  balance  between  efficacy  and  cost 
and  hence  it  represents  the  most  suitable  choice.  Otherwise,  in  the  case  of  general- 
purpose  processors,  the  HP  achieves  similar  results  to  the  SP. 


5  Performance  analysis 

From  the  previous  section  we  can  conclude  that  the  MediaBench  suite  exhibits  a 
higher  value  predictability  than  SPEC'95.  However,  to  justify  the  use  of  the  extra 
value  prediction  hardware,  it  is  essential  to  prove  that  the  processor  performance  is 
significantly  improved. 

In  this  section  we  evaluate  the  achievable  speedup  from  using  value  prediction  in 
two  typical  processor  architectures:  a  high-performance  embedded  processor,  and  a 
high-performance  general  purpose  processor. 


5.1  Machine  model 


A  detailed  description  of  all  the  hardware  mechanisms  involved  in  the  value 
speculation  technique  is  beyond  the  scope  of  the  present  work.  We  just  want  to  briefly 
introduce  the  architecture  employed  in  the  timing  simulations,  which  is  explained  in 
more  detail  in  the  Technical  Report  [16]. 


FETCH 

Pert  ect  fetch  I 

mechanism 

Value 

Predictor 

.  1  Updating  L 
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Commit  logic 

_ 1 

Fig.  7.  Architecture  Block  Diagram. 

Our  baseline  architecture,  shown  in  figure  2,  is  derived  from  the  architecture  used 
by  the  SimpleScalar  Out-of-Order  simulator  [13].  This  architecture  is  based  on  the 
Register  Update  Unit  (RUU)  [17],  which  is  a  scheme  that  unifies  the  instruction 
window,  the  rename  logic,  and  the  reorder  buffer  under  the  same  structure. 
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Predictor  Lookup.  The  value  predictor  is  accessed  in  parallel  with  the  instruction 
fetch  using  the  addresses  of  the  instructions  fetched  in  each  cycle,  and  it  provides  the 
predicted  output  values  (if  available)  of  these  instructions. 

Scheduling  policy.  The  scheduling  policy  firstly  issues  the  instructions  with  actual 
operands,  and  thus  instructions  with  predicted  or  speculative  operands  are  issued  later. 
Within  each  group,  an  oldest-instruction-first  policy  is  used.  Using  this  policy, 
speculative  instructions  are  not  issued  while  there  are  enough  non-speculative 
instructions  ready  to  execute,  even  if  these  non-speculative  instructions  are  newer 
than  the  speculative  ones. 

Validation  and  misprediction  recovery.  The  process  of  validation/invalidation  of 
speculative  instructions  is  performed  during  write-back.  This  process  is  performed  in 
parallel,  i.e.  all  the  instructions  within  a  dependence  chain  can  be 
validated/invalidated  in  a  single  cycle.  The  instructions  whose  operands  have  been 
validated  can  commit  in  the  next  stage.  On  the  other  hand,  those  instructions  whose 
operands  have  been  invalidated  must  be  re-executed.  In  view  of  the  fact  that  it  is  not 
possible  to  check  the  validity  and  re-schedule  the  invalidated  instructions  in  the  same 
cycle,  it  is  obvious  that  these  instructions  cannot  be  re-executed  in  the  next  cycle. 
Consequently,  they  are  delayed  one  cycle  in  relation  to  normal  execution. 


Baseline  architectures.  Table  3  shows  the  main  parameters  of  the  two  selected 
architectures:  a  4-width  embedded  processor  architecture  and  a  6-width  general 
purpose  architecture.  Most  of  the  parameters  of  these  architectures  (fetch/decode 
width,  issue  width,  instruction  window  and  LI -cache  size)  have  been  taken  from  two 
highly  evolved  representative  commercial  processors:  the  AMD  K6-2E  embedded 
processor  core  [18],  and  the  AMD  Athlon  general-purpose  processor  core  [19]  (notice 
that  fetch/decode  width  refers  to  RISC  instructions).  Other  parameters,  like  functional 
units,  have  been  adapted  to  Simplescalar  Simulator,  which  does  not  support  special 
instructions  (like  MMX  or  3DNow).  Furthermore,  since  value  prediction  significantly 
increases  the  pressure  on  execution  units,  the  number  of  functional  units  and  memory 
ports  has  been  slightly  increased  in  order  to  avoid  the  bottleneck  in  the  execution 
stage. 

Table  5.  Architectural  Parameters. 


Configuration  parameters 

Embedded  Processor 

General-purpose  Processor 

Fetch/decode  width 

4 

6 

Issue  width 

6 

9 

Instruction  window 

24 

72 

Load  Store  Queue 

12 

36 

#  Integer  ALU 

4 

6 

#  Integer  Multiplier 

1 

2 

#  Floating  Point  ALU 

4 

6 

#  Floating  Point  Multiplier 

1 

2 

#  Memory  Ports 

2 

3 

LI  I  Cache  /  LI  D  Cache 

32KB  /  32KB 

64KB  /  64KB 

Ll  Latency 

1 

1 

L2  Cache  Size 

No 

4MB 

L2  Latency 

6 

Memory  Latency 

10 

10 
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5.2  Comparative  results 

In  the  previous  section  we  concluded  that  the  hybrid  predictor  exhibits  the  best 
cost/efficacy  trade-off.  This  observation,  along  with  the  fact  that  detailed  timing 
simulations  take  a  long  time  to  execute,  led  us  to  use  only  the  hybrid  predictor  to 
show  performance  results. 

Figure  8  shows  the  percentage  of  speedup  achievable  for  both  architectures 
(embedded  and  general  purpose)  and  both  benchmark  suites  (MediaBench  and  SPEC) 
using  the  hybrid  predictor  with  various  cost  configurations  (under  16  KBytes),  and 
using  a  2-bit  saturating  counter  for  confidence  estimation  with  a  confidence  threshold 
equal  to  3.  Both  the  average  and  the  realistic  mean  (eliminating  the  best  and  the  worst 

cases)  are  displayed  in  this  figure.  . 

Two  main  conclusions  can  be  drawn  from  this  figure.  First,  the  predictability 
results  shown  in  the  previous  section  have  a  direct  equivalence  in  the  performance 
results,  since  the  speedup  obtained  for  the  MediaBench  suite,  for  both  architectures 
and  all  the  predictor  configurations,  is  higher  than  the  speedup  reached  for  the  SPEC 
suite.  Second,  the  difference  between  the  average  and  the  realistic  mean  curves  for  the 
SPEC  benchmarks  is  much  more  prominent  than  the  difference  between  the 
predictability  curves  shown  in  the  previous  section.  Therefore,  the  sensitivity  of  the 
SPEC  suite  in  the  extreme  cases  has  an  even  higher  impact  on  speedup.  This  behavior 
is  mainly  due  to  the  irregular  results  obtained  for  the  m88ksim  benchmark,  which 
achieves  a  much  higher  speedup  than  the  other  benchmarks.  On  the  other  hand, 
MediaBench  benchmarks  exhibit  a  much  more  regular  behavior,  since  the  difference 
between  the  average  and  the  realistic  mean  curves  is  of  little  significance. 


■a—  MB  Average  ■  *  *  MB  Realistic 

-M - SPEC  Average  SPEC  Realistic 


Predictor  Cost  (Kbytes) 


■MB  Average  •  ”  "  MB  Realistic 

■SPEC  Average  *  ■  K-  -  SPEC  Realistic 


Predictor  Cost  (Kbytes) 


a)  Embedded  Processor 


b)  General  Purpose  Processor 


Fig.  8.  %  Speedup  achieved  with  the  hybrid  value  predictor 


Figure  9  highlights  the  differences  in  the  speedup  achieved  by  using  value 
prediction  in  the  two  habitual  working  situations:  the  embedded  processor  running 
MediaBench-like  applications,  and  the  general-purpose  processor  running  SPEC-like 
applications. 

Despite  the  general-purpose  processor  having  wider  fetch  and  issue,  together  with 
a  larger  window,  the  embedded  processor  obtains  a  significantly  higher  speedup 
through  value  prediction.  These  results  reveal  two  important  facts.  First,  as  we  have 
proved  throughout  this  paper,  typical  applications  of  embedded  systems,  like 
MediaBench  benchmarks,  exhibit  a  higher  value  predictability  than  general-purpose 
applications.  Second,  as  shown  in  [16],  the  value  prediction  technique  gets  better 


-331- 


FEUP  -  Faculdade  de  Engenharia  da  Universidade  do  Porto 


speedup  results  when  the  processor  uses  a  small  to  medium  size  instruction  window. 
The  explanation  of  this  effect  is  simple.  With  a  small  to  medium  window  size,  the 
number  of  independent  instructions  kept  in  the  window  are  not  enough  to  cover  the 
available  issue  bandwidth,  hence  value  prediction  can  be  efficiently  exploited  because 
it  allows  data  dependencies  to  be  broken,  and  a  good  number  of  dependent  instruction 
to  be  issued  in  parallel.  However,  as  the  window  enlarges,  the  number  of  independent 
instructions  kept  in  the  window  also  increases,  and  hence  value  prediction  becomes 
less  useful,  since  it  is  easier  to  find  enough  independent  instructions  in  the  window  to 
feed  the  issue  bandwidth.  In  view  of  this  fact,  embedded  processors  can  benefit  more 
from  value  prediction  than  general  purpose  processors,  because  they  usually  employ 
smaller  windows  due  to  area  restrictions  (24  and  72,  respectively  in  our  architectures). 

□  Embedded  processor  running  MediaBertchs 


Predictor  Cost  (Kbytes) 


Fig.  9.  Speedup  achievable  in  habitual  working  situations  (realistic  mean) 

A  common  question  many  times  asked  about  the  use  of  value  prediction  is  if  the 
extra  prediction  hardware  spent  could  be  better  employed  in  other  parts  of  the 
processor,  which  could  yield  a  higher  benefit  in  the  overall  performance  --  for 
example  increasing  the  LI -cache  size  — .  With  this  idea  in  mind  we  performed  some 
experiments  whose  results  are  displayed  in  Figure  10.  This  figure  shows  the  speedup 
obtained  by  doubling  the  Ll-cache  (both  the  instruction  and  data  caches)  in  the 
embedded  processor  and  the  general  purpose  processor  (both  processors  running  the 
MediaBench  benchmarks),  and  it  is  compared  to  the  speedup  obtained  by  using  a  14 
Kbyte  hybrid  value  predictor. 


D  LI -cache  x  2  Mvalue  Prediction  (Hybrid) 


a)  Embedded  processor 


b)  General-purpose  processor 


Fig.  10.  Speedup  obtained  by  doubling  the  Ll-cache  and  by  using  value  prediction 
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We  can  observe  that,  for  both  processors,  the  speedup  achievable  using  the  value 
prediction  is  much  higher  than  increasing  the  cache  size.  This  difference  is  more 
prominent  in  the  general  purpose  processor  (when  executing  MediaBench),  since 
increasing  the  cache  scarcely  affects  performance.  Furthermore,  the  cost  of  the 
prediction  hardware  (14  Kbytes)  is  much  lower  than  the  cost  of  doubling  the  Ll-cache 
(64  Kbytes  in  the  embedded  processor  and,  128  Kbytes  in  the  general  purpose 
processor).  So,  we  can  conclude  that  value  prediction  is  a  profitable  hardware 
investment  for  processor  performance. 


6  Conclusions  and  Future  Work 

The  objective  of  this  work  is  to  apply  value  prediction  techniques  in  the  ambit  of 
embedded  processors  and  to  demonstrate  their  higher  efficiency  within  this  scope. 
The  main  conclusions  that  can  be  drawn  from  this  study  are  the  following: 

•  Our  initial  intuition  was  verified  and  we  have  demonstrated  that  multimedia  and 
communication  programs  present  a  more  highly  predictable  value  behavior  than 
normal  programs.  Furthermore,  a  high  degree  of  predictability  can  be  obtained 
using  low-cost  value  predictors,  and  therefore  employing  value  prediction  seems 
reasonable  for  this  particular  kind  of  applications. 

•  By  means  of  detailed  timing  simulations,  and  using  two  generic  high-performance 
architectures,  one  for  an  embedded  processor  and  another  for  a  general  purpose 
processor,  we  have  shown  that  the  higher  predictability  of  multimedia  and 
communication  programs  has  a  direct  impact  on  the  performance  results,  since  the 
speedup  obtained  for  the  MediaBench  suite,  for  both  architectures  and  all  the 
predictor  configurations  is  higher  than  the  speedup  attained  for  the  SPEC  suite. 

•  In  spite  of  the  general-purpose  processor  having  a  wider  fetch  and  issue,  as  well  as 
a  larger  window,  the  speedup  achievable  using  value  prediction  in  a  embedded 
environment  is  significantly  higher.  This  is  due  to  both  the  higher  value 
predictability  of  multimedia  and  communication  applications  and  the  lower 
instruction  window  used  in  embedded  processors,  which  allows  more  efficient 
exploitation  of  value  prediction. 

•  Finally,  we  have  shown  that  the  speedup  obtained  by  using  a  hybrid  value 
predictor  is  appreciably  higher  than  the  speedup  obtained  by  doubling  the  Ll- 
cache.  These  results  prove  that  the  hardware  invested  on  value  prediction  is  a 
beneficial  expense  for  the  processor  performance. 

Nevertheless,  this  work  must  be  interpreted  as  a  first  step  towards  integrating  value 
speculation  into  embedded  processor  architecture.  We  believe  that  there  is 
considerable  work  to  be  carried  out,  especially  in  relation  to  performance/cost 
analysis,  power-consumption  considerations,  and  confidence  estimation.  Our  future 
research  will  cover  these  issues,  and  also  deepen  the  analysis  of  the  hardware 
mechanisms  involved  in  value  speculation. 
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Abstract.  We  present  a  parallelization  of  Petkov,  Christov,  and  Konstantinovs 
algorithm  for  the  pole  assignment  problem  of  single-input  systems.  Our  new  im¬ 
plementation  is  specially  appropriate  for  current  high  performance  processors  and 
shared  memory  multiprocessors  and  obtains  a  high  performance  by  reordering  the 
access  pattern,  while  maintaining  the  same  numerical  properties. 

The  experimental  results  on  two  different  platforms  (SGI  PowerChallenge  and 
SUN  Enterprise)  report  a  higher  performance  of  the  new  implementation  over  tra¬ 
ditional  algorithms. 

Topics:  Numerical  methods,  parallel  and  distributed  algorithms. 


1  Introduction 

Consider  the  continuous,  time-invariant  linear  system  defined  by 
x(t )  =  Ax(t)  +  Bu(t),  a:(0)  =  x0, 

with  n  states,  in  vector  x(t),  and  m  inputs,  in  vector  y(t).  Here,  A  is  the 
n  x  n  state  matrix,  and  B  is  the  n  x  m  input  matrix. 

In  the  design  of  linear  control  systems,  u(t)  is  used  to  control  the  be¬ 
haviour  of  the  system.  Specifically,  the  control 

u(t)  =  -Fx(t), 

where  F  is  an  m  x  n  feedback  matrix,  is  used  to  modify  the  properties  of  the 
closed-loop  system 

x(t)  =  (A  —  BF)x(t). 

The  problem  of  finding  an  appropriate  feedback  F  is  referred  to  as  the 
problem  of  synthesis  of  a  state  regulator  [11].  In  some  applications,  e.g.,  for 

*  Supported  by  the  Conselleria  de  Cultura,  Education  y  Ciencia  de  la  Generalidad 
Valenciana  GV99-59-1-14  and  the  Fundacio  Caixa-Castello  Bancaixa. 
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asymptotic  stability  [4,11],  F  can  be  chosen  so  that  the  eigenvalues  of  the 
closed-loop  matrix  are  in  the  open  left-half  complex  plane. 

In  this  paper  we  are  interested  in  the  pole  assignment  problem  of  single¬ 
input  systems  (m  =  1  and  B  =  b  is  a  vector),  or  PAPSIS,  which  consists  in  the 
determination  of  a  feedback  vector  F  =  /,  such  that  the  poles  of  the  closed- 
loop  system  are  allocated  to  a  pre-specified  set  A  =  {Ai,  A2, . . . ,  An}  [4]. 
This  problem  has  a  solution  (unique  in  the  single-input  case)  if  and  only  if 
the  system  is  controllable  [15].  We  assume  hereafter  that  this  condition  is 
satisfied. 

A  survey  of  existing  algorithms  for  the  pole  assignment  problem  can  be 
found,  e.g.,  in  [4-6,11,14].  Among  these,  methods  based  on  the  Schur  form  of 
the  closed-loop  state  matrix  [6,9,10]  are  numerically  stable  [3,7]. 

In  [2]  we  apply  block-partitioned  techniques  to  obtain  efficient  implemen¬ 
tations  of  Miminis  and  Paige’s  algorithm  for  PAPSIS  [6].  In  this  paper  we 
apply  similar  techniques  to  obtain  LAPACK-like  [1]  block-partitioned  vari¬ 
ants  and  parallel  implementations  of  Petkov,  Christov,  and  Konstantinov’s 
algorithm  (hereafter,  PCK)  [10]  for  PAPSIS. 

We  assume  the  system  to  be  initially  in  unreduced  controller  Hessenberg 
form  [13].  This  reduction  can  be  carried  out  by  means  of  efficient  blocked 
algorithms  based  on  (rank- revealing)  orthogonal  factorizations  [12]. 

Our  algorithms  are  specially  designed  to  provide  a  better  use  of  the  cache 
memory,  while  maintaining  the  same  numerical  properties.  The  experimental 
results  on  SGI  Power  Challenge  and  SUN  Enterprise  multiprocessors  report 
the  performance  of  our  block-partitioned  serial  and  parallel  algorithms. 


2  The  sequential  PCK  algorithm 


Consider  the  controllable  single-input  system  in  controller  Hessenberg  form 
defined  by  (A,b),  with  real  entries, 


(b\A)  = 

'ft 

0.11  .  .  .  G^n— 1  OL\n 
a21  •  •  •  012,71-1  OL2n 

^n,n— 1  OL-nn  _ 

As  the  system  is  controllable,  it  can  be  shown  that  ft, a2i, ...,a„  n-i  ^ 
0  [13]. 

The  PCK  algorithm  is  based  on  orthogonal  transformations  of  the  eigen¬ 
vectors  and  proceeds  as  follows.  (For  simplicity  we  only  describe  the  algorithm 
for  pole  assignment  of  real  eigenvalues.)  Let  A  €  R  and  v  £  IT  be,  respec¬ 
tively,  an  eigenvalue  and  its  corresponding  eigenvector  of  the  closed-loop  ma¬ 
trix  A  -  bf .  Let  Q  be  an  orthogonal  matrix  such  that  Qv  =  (tq ,  0, . . . ,  0)T. 
This  matrix  can  be  constructed  so  that  QTAQ  and  QT(A  -  bf)Q  are  in 
Hessenberg  form.  Furthermore, 

QT(A  -  bf)Qe\  =  (A,  0, . . . ,  0)r,  (2) 
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where  e\  is  the  first  column  of  the  identity  matrix,  and  solving  (2)  we  find 
the  first  element  of  the  transformed  feedback  f  =  fQ  from  the  correspond¬ 
ing  elements  of  QTAQ  and  QTb.  After  this  stage,  the  procedure  is  repeated 
with  the  lower  trailing  blocks  of  order  n  -  1  of  the  transformed  matrices  to 
assign  a  new  pole.  By  proceeding  recursively  we  obtain  /,  and  /  =  fQT ■  The 
procedure  for  assigning  A  =  { Ai,  A2,  ■  ■  • ,  An}  can  be  roughly  stated  as  follows. 


for  i  =  1, . . .  ,n  —  1 

Set  vn  =  1  and  compute  v„-i 
for  j  =  n  —  1,  n  —  2, . . . ,  i 

Compute  Uj_i 

Construct  a  Givens  rotation  Rij+i  €  JR,nx"  such  that 

(th , . . . ,  Vj ,  vj+ 1 , 0, . . . ,  0)Ri,j+i  =  (id ,  -  • ,  vj ,  0, . . . ,  0) 

Apply  the  transformation  A  =  Rij+\ARfj+1 
end  for 

Apply  the  transformation  b  =  Ri^+ib 
Compute  fi  =  a,+i,i/6j+i 

end  for 

Compute  fn  —  ( an<n  -  An)/6„ 

At  each  iteration  of  the  outer  loop  a  new  pole  is  assigned.  In  the  inner  loop, 
at  each  iteration  we  compute  a  component  of  eigenvector  v  (j  -  1),  obtain  a 
transformation  to  introduce  a  zero  in  a  component  of  the  eigenvector  (j  + 1), 
and  finally  apply  this  transformation  on  the  system  matrix. 

3  Parallelization  of  the  PCK  algorithm 

In  traditional  implementations  of  this  algorithm  each  transformation  matrix 
Ri  j+ 1  is  applied  immediately  after  it  is  computed.  Thus,  at  each  iteration  of 
loop  j,  two  rows  and  columns  (j-th  and  j  4-  1-th)  of  the  matrix  are  referenced. 

Our  block-partitioned  algorithms  reduce  the  number  of  data  references 
by  delaying  the  update  of  some  entries  the  matrix.  Thus,  we  work  on  the 
transformed  lower  Hessenberg  matrix  AT ,  partition  this  matrix  by  blocks  of 
columns  (see  figure  1),  and  delay  the  application  of  transformations  from 
the  left  until  the  proper  block  is  referenced.  Although  the  parameters  of  the 
delayed  transformations  need  to  be  stored,  the  dimension  of  this  work  space 
is  small. 

Specifically,  consider  the  assignment  of  the  first  pole  in  the  block-partition¬ 
ed  algorithm: 

-  A  set  of  transformations  are  computed  to  shift  up  the  pole,  until  it  dis- 
sapears  on  the  top  left  corner  of  block  Bl,  and  the  transformations  are 
only  applied  to  Bl.  The  application  of  this  update  from  the  left  to  blocks 
B2,  ....  B6  is  delayed. 
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Fig.  1.  Partition  of  the  matrix  by  blocks  of  columns. 

-  The  procedure  continues  with  block  B2.  First,  the  delayed  update  is  ap¬ 
plied  from  the  left  to  B2.  Then,  a  new  set  of  transformations  are  computed 
to  shift  up  the  pole,  and  these  transformations  are  only  applied  to  B2. 
(The  application  of  this  update  from  the  left  to  blocks  B3,  . . . ,  B6  is 
delayed.) 

-  The  procedure  is  repeated  with  blocks  B3,  B4,  B5,  and  B6,  until  the  pole 
is  assigned  and  the  problem  is  deflacted. 


In  the  parallel  algorithm  we  are  interested  in  an  algorithm  with  a  higher  (and 
coarser)  degree  of  parallelism  than  that  achieved  with  the  application  of  a 
single  tranformation.  Notice  that  in  each  iteration  of  the  inner  loop  j  two 
rows  and  two  columns  of  the  matrix  are  modified.  Thus,  as  soon  as  j  =  n-4, 
it  would  be  possible  to  start  the  assignment  of  a  different  pole. 

This  is  a  pipelined  algorithm.  Specifically,  the  assignment  of  a  new  pole  can 
be  started  as  soon  as  the  transformations  related  to  the  previous  pole  do  not 
affect  to  the  last  block  of  columns.  Thus,  it  is  possible  to  assign  in  parallel 
as  many  poles  as  blocks  in  the  partition  of  AT. 

In  our  algorithm,  the  maximum  number  of  pipelined  stages  is  !L~.  where  n 
and  nb  are  the  problem  size  and  block  size  respectively.  Figure  2  shows  the 
evolution  of  the  different  stages  in  our  pipelined  algorithm.  As  the  problem 
is  deflacted,  the  number  of  blocks  of  columns  (and  therefore  the  number 
of  pipelined  stages)  decreases.  In  practice,  nb  must  be  larger  than  three; 
otherwise,  the  stages  can  not  be  correctly  pipelined. 


4  Experimental  Results 

In  this  section  we  report  the  results  of  our  numerical  experiments  on  a  SGI 
PowerChallenge  (SGI  MIPS  R10000)  and  a  SUN  Entreprise  4000  (SUN  Ul¬ 
traSPARC)  multiprocessors.  All  our  experiments  were  performed  using  IEEE 
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Fig.  2.  Evolution  of  the  pipelined  algorithm. 
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double-precision  arithmetic  and  Fortran-77  (e«2,2x  10-16).  We  have  em¬ 
ployed  in  our  implementations  orthogonal  transformations  based  on  Givens 
rotations.  The  system  pair  (A,  b)  was  generated  so  that  the  computation  of 
the  feedback  matrix  was  well-conditioned. 

We  have  developed  the  following  pole-assignment  algorithms: 

-  BPAPSIS:  Block-partitioned  algorithm. 

-  PPAPSIS:  Parallel  version  of  the  block-partitioned  algorithm. 

Figure  3  shows  the  speed-up  of  our  block-partitioned  algorithm  for  dif¬ 
ferent  block  dimensions  and  problem  sizes,  nb  and  n  respectively.  We  test 
system  of  moderate  size  from  100  to  1000,  using  block  sizes  of  {nb  =)1  (non- 
blocked  algorithm),  32,  64  and  100  for  the  SGI  MIPS  R10000  processor,  and 
nb—  1,  16,  32  and  64  for  the  SUN  UltraSPARC  processor.  The  results  are  av¬ 
eraged  for  5  executions  on  different  random  matrices.  In  all  the  experiments 
the  blocked  implementations  clearly  outperform  the  sequential  code  {nb  —  1), 
except  on  SGI  MIPS  R10000  when  the  problem  size  is  reduced  (n  <  200). 


SGI  MIPS  R1 000 


SUN  UltraSPARC 


Fig.  3.  Speed-up  of  the  block-partitioned  algorithm  on  the  SGI  MIPS  R10000  (left) 
and  the  SUN  UltraSPARC  (right)  processors. 


Figure  4  shows  the  efficiency  of  our  parallel  algorithm  compared  with  the 
non-blocked  and  blocked  algorithms  using  np  =  2, 4, . . . ,  12  processors.  These 
figures  report  the  efficiency  versus  problem  size  on  the  SGI  PowerChallenge 
and  SUN  Enterprise  platforms.  The  blocked  and  parallel  algorithm  employ 
the  optimal  block  size  determined  in  the  previous  experiment,  i.e.,  nb  — 
100  and  nb  =  32  for  SGI  and  SUN,  respectively.  As  these  figures  show  if 
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we  compare  our  parallel  algorithms  with  the  serial  algorithm  (non-blocked) 
efficiencies  higher  than  1  are  obtained.  On  the  other  hand  if  the  paralell 
algorithm  is  compared  with  blocked  algorithm,  the  maximum  efficiency  is 
80%  and  decrease  as  the  number  of  processors  of  the  system  is  increased, 
since  the  problem  size  is  moderate. 


SGIPo.eC*™  SUN  Ertepree  4000 


Fig-  4.  Efficiency  of  the  parallel  algorithm  compared  with  the  non-blocked  algo¬ 
rithm  (top)  and  the  blocked  algorithm  (bottom)  on  the  SGI  PowerChallenge  (left) 
and  the  SUN  Enterprise  4000  (right)  multiprocessors. 
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5  Conclusions 

We  have  presented  block-partitioned  and  parallel  versions  of  Petkov,  Chris¬ 
tov,  and  Konstantinov’s  algorithm  for  the  pole  assignment  problem  of  single¬ 
input  systems.  Our  block-partitioned  algorithms  achieve  a  high  speed-up  on 
SGI  and  SUN  processors,  while  maintaining  the  same  numerical  properties. 

The  experimental  results  of  the  parallel  algorithms  also  show  an  impor¬ 
tant  increase  in  performance  on  an  SGI  PowerChallenge  and  SUN  Enterprise 
platforms. 
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Abstract.  Parallel  algorithms  for  solving  nonlinear  systems  are  stud¬ 
ied.  Non-stationary  parallel  algorithms  based  on  the  Newton  method  are 
considered.  Convergence  properties  of  these  methods  are  studied  when 
the  matrix  in  question  is  either  monotone  or  an  //-matrix.  In  order  to  il¬ 
lustrate  the  behavior  of  these  methods,  we  implemented  these  algorithms 
on  two  distributed  memory  multiprocessors.  The  first  platform  is  an  Eth¬ 
ernet  network  of  five  120  MHz  Pentiums.  The  second  platform  is  an  IBM 
RS/6000  with  8  nodes.  Several  versions  of  these  algorithms  are  tested. 
Experiments  show  that  these  algorithms  can  solve  the  nonlinear  system 
in  substantially  less  time  that  the  current  (stationary  or  non-stationary) 
parallel  nonlinear  algorithms  based  on  the  multisplitting  technique. 

Topics.  Numerical  methods,  parallel  and  distributed  algorithms. 

1  Introduction 

Let  F  :  Htn  -»  HI”  be  a  nonlinear  function.  We  are  interested  in  the  parallel 
solution  of  the  system  of  nonlinear  equations 

F{x)  =  0,  (1) 

where  it  is  assumed  that  a  solution  x*  exists.  We  suppose  that  there  exists  an 
Tq  >  0  such  that 

(i)  F  is  differentiable  on  So  =  {x  €  IR"  :  ||x  —  x*||  <  ro}, 

(ii)  the  Jacobian  matrix  at  x* ,  F'(x*),  is  nonsingular, 

(iii)  there  exists  an  L  >  0  such  that  for  x  £  So,  || F'(x)  —  F  (x  )||  <  L ||x  — x  ||. 

Under  assumptions  (i)  (iii),  a  well-known  method  for  solving  the  nonlinear 

system  (1)  is  the  classical  Newton  method  (cf.  [11]).  Given  an  initial  vector  x  0  , 
this  method  produces  the  following  sequence  of  vectors 

*('+!>  =*<'>-3:</+i>,  *  =  0,1,...,  (2) 

*  This  research  was  supported  by  Spanish  DGESIC  grant  number  PB98-0977. 
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es 


where  is  the  solution  of  the  linear  system 

F'(r.W)z  =  F(: *W).  (3) 

On  the  other  hand,  if  we  use  an  iterative  method  to  approximate  the  solution 
of  (3)  we  are  in  the  presence  of  a  Newton  iterative  method;  see  e.g.,  [11]  and 
[12].  In  order  to  generate  efficient  algorithms  to  solve  nonlinear  system  (1)  on 
a  parallel  computer,  White  [14]  defines  the  parallel  Newton-SOR  method,  that 
generalizes  a  particular  Newton  iterative  method,  the  Newton-SOR  method. 
In  [14],  White  also  introduces  a  parallel  nonlinear  Gauss-Seidel  algorithm  for 
approximating  the  solution  of  an  almost  linear  system,  that  is,  to  solve  (1)  when 
F(x)  =  Ax+$(x)  —  b,  where  A  =  (atj)  is  a  real  nxn  matrix,  x  and  b  are  n-vectors 
and  $  :  IR”  — -> IR"  is  a  nonlinear  diagonal  mapping  (i.c.,  the  ith  component  d>,  of 
is  a  function  only  of  .r,:).  Bai  [2],  has  generalized  the  parallel  nonlinear  Gauss- 
Seidel  algorithm  in  the  context  of  relaxed  methods.  Both  methods  are  based 
on  the  use  of  the  multisplitting  technique  (sec  [10]).  On  the  other  hand,  Bru, 
Eisner  and  Neumann  [4]  studied  two  non-stationary  methods  (synchronous  and 
asynchronous)  based  on  the  multisplitting  method  for  solving  linear  systems  in 
parallel.  As  it  can  be  seen  e.g.,  in  [G]  and  [9],  non-stationary  algorithms  behave 
better  than  the  multisplitting  method.  Recently,  in  [1]  we  have  extended  the 
idea  of  the  non-stationary  methods  to  the  problem  of  solving  an  almost  linear 
system.  These  methods  are  a  generalization  of  the  parallel  nonlinear  Gauss-Seidel 
algorithm  [14]  and  the  parallel  nonlinear  AOR  method  [2], 

In  this  paper  we  construct  a,  parallel  Newton  iterative  algorithm  to  solve  the 
general  nonlinear  system  (1)  that  uses  non-stationary  multisplitting  models  to 
approximate  linear  system  (3).  For  this  purpose,  let  us  consider  for  each  x,  a 
multisplitting  of  F’{x),  {Mk(x),Nk(x),Ek}pk=v  that  is,  a  collection  of  splittings 

F'(x)  =  Mk(x)  -  Nk(x),  1  <  k  <  p,  (4) 


and  diagonal  nonnegative  weighting  matrices  Ek  which  add  to  the  identity. 
Let  us  further  consider  a  sequence  of  integers  q((.,  s,  k),  £  —  0. 1,2, ... ,  .$  = 
1,2, ...,raf,  1  <  k  <  p,  called  non-stationary  parameters.  Following  [4]  or  [9] 
the  linear  system  (3)  can  be  approximated  by  x (f+2)  as  follows 


X^+i)  —  z(mc ) 

Z {s)  =  Htta(xW)zl—V  +  BfiS(xw)F{x^),  H 


1,  2, . . .  ,m.f, 


H*AX)  =  Y,Ek  (MA(x)Nk(x)fe's'k) ,  (5) 

h=  1 

p  q{(.,s,k)  —  1 

B(Jx)  =  Y,Ek  Y,  {M^(x)Nk(x))jM^(x)  (6) 

A-=l  j= 0 

V 

=  i1  -  (M^(x)Nk(x))^kA  (F’(x)r\  (7) 

fe=i  ' 
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and  z =  0.  Thus 

(me- 1  mi  \ 

i=  1  j=i+ 1  / 

where  Y[m=i+\  HtJ(x^)  denotes  the  product  of  the  matrices  He>j(: rm)  in  the 
order  He,Je(x{t))H(,mt-AxW)  •  •  •  Therefore,  from  (2)  the  non-sta- 

tionary  parallel  Newton  iterative  method  can  be  written  as  follows. 

xff+1)  =Ge,me{x({)),  (8) 

where 

G(tTni  (*)  =  x.  —  Aftme(x)F(x) 

and  \ 

/  me— 1  mi  \ 

W*)=  £  n  Hij(x)Bt,i{x)  +  Bl,mt{x)  .  (9) 

\  i=l  j=i+l  / 

We  note  that  the  formulation  of  this  method  allows  us  to  use  different  number 
of  local  iterations  q(P,  s,  fc)  not  only  in  each  processor  k  and  at  each  nonlinear 
iteration  i  but  at  each  linear  iteration  s.  Moreover,  this  method  extends  the 
parallel  Newton  method  introduced  by  White  [14]. 

In  the  following  section  we  analyze  the  convergence  properties  of  this  algo¬ 
rithm  when  the  Jacobian  matrix  is  monotone  or  an  //-matrix.  Section  3  contains 
some  numerical  experiments,  which  illustrate  the  performance  of  the  algorithms 
studied,  on  an  Ethernet  network  of  five  120  MHz  Pentiums  and  on  an  IBM 
RS /6000  SP.  In  the  rest  of  this  section  we  present  some  notation,  definitions  and 
preliminary  results  used  in  the  paper. 

A  matrix  A  is  said  to  be  a  nonsingular  M- matrix  if  A  has  all  nonpositive 
off-diagonal  entries  and  it  is  monotone,  i.e.,  A  1  >  O.  For  any  matrix  A  — 
(ctij)  G  IRnx”,  we  define  its  comparison  matrix  (A)  =  (o'y)  by  =  |gm|,  otij  = 
_|aij.|5  i  ±  j.  The  matrix  A  is  said  to  be  an  H- matrix  if  (A)  is  a  nonsingular  M- 
matrix.  The  splitting  A  =  M  -  N  is  called  a  weak  regular  splitting  if  M~l  >  O 
and  M~1N  >  O;  the  splitting  is  an  //-compatible  splitting  if  (A)  =  ( M )  -  |AT|; 
see  e.g.,  Berman  and  Plemmons  [3]  or  Varga  [13]. 

A  sequence  {i<^}  converges  Q-quadratically  to  x*  if  there  exists  c  <  1  such 

that 

||.-r(/+1)  —  ®*1|  <C||.TW-.T*||2. 

L(IRn)  denotes  the  linear  space  of  linear  operators  from  IR”  to  It”. 

Lemma  1.  Suppose  that  the  mapping  A  :  D  C  It”1  — >  L(IR”)  is  continuous  at. 
a  point.  x°  £  D  for  which  A{x°)  is  nonsingular.  Then  there  is  a  6  >  0  and  a 
ft  >  0  so  that,  for  any  x  £  D  n  {x  :  ||s  -  x°\\  <  6},  A{x)  is  nonsingular  and 
||A(.r)_1 1|  <  ft.  Moreover,  A(.t)-1  is  continuous  in  x  at.  x°. 

Proof.  See  Ortega  and  Rheinboldt  [11]. 
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Theorem  1.  Suppose  F  :  D  C  IR”  — + IRm  is  G- differentiable  at  each  point  of  a 
convex  set  Dq  C  D,  then  for  any  x,y,z  £  Do, 

\\F(y)  -  F(z)  -  F'(x)(y  -  z)\\  <  \\y  -  z\\  sup  \\F'(z +  t(y  -  z))  -  F'(x)\\. 

0<t<l 

Proof.  See  Ortega  and  Rheinboldt  [11], 

2  Convergence 

In  this  section  we  study  the  convergence  of  the  iterative  scheme  (8).  For  this 
purpose  we  need  to  make  the  following  additional  assumptions  on  the  splittings 

(4). 

(iv)  There  exist  4  >  0,  1  <  k  <  p,  such  that  for  x  £  So,  \\Mk{x)  -  Mk{x*)\\  < 
tk\\T-T*\\. 

(v)  Mh{x*),  1  <  k  <  p,  arc  nonsingular. 

(vi)  There  exits  0  <  a  <  1,  such  that,  for  each  positive  integer  s  and  £  =  0, 1, . . . , 
where  He,s(x*)  is  defined  in  (5). 

From  assumptions  (i)-(iii)  of  Section  1  and  using  Lemma  1,  it  follows  that 
there  exists  0  <  rq  <  ro  such  that  F'  is  continuous  and  nonsingular  in  Si  — 
{x  £  IR"  :  ||.r  —  .r*||  <  rj}.  On  the  other  hand,  it  can  be  shown  (sec  e.g.,  Ortega 
and  Rheinboldt  [11])  that  Newton  method  (2)  converges  Q-quadratically  to  x* 
in  a  neighborhood  of  x*.  In  order  to  simplify  the  notation  we  also  denote  this 
neighborhood  by  S\.  From  assumptions  (iv)-(v)  and  Lemma  1,  it  follows  that 
Mfe,  1  <  k  <  p,  is  continuous  and  nonsingular  in  a  neighborhood  of  x*,  say 
again  Si.  Therefore  Mk(x)~1Nk(x),  1  <  k  <  p,  is  well  defined  and  moreover 
continuous  in  Si.  Then,  H(tS(x)  is  also  continuous  in  Si.  Now,  from  assumption 
(vi)  it  obtains  that  ||LT^(.r)||  <  a,  f  =  0, 1, ,  s  =  1,2,..., ny,  in  a  neighbor¬ 
hood  of  .r*,  denoted  again  by  Si.  Moreover,  since  Mk{x)~l Nk(x),  1  <  k  <  p, 
are  continuous  in  Si ,  there  exists  a  positive  integer  K  such  that 

||Mfc(.r)-1./V*,(.T)||  <K,  1  <  k  <  p,  (10) 

for  all  x  in  a  neighborhood  of  x* ,  that  we  denote  again  by  Si. 

Lemma  2.  Let  A  :  IR”  — >  L(IR")  be  a  mapping  such  that  ||^4(.r)j]  <  8,  in  a 
neighborhood  S  of  x* .  Then  for  any  x  €  S  and  for  any  positive  integer  m 

\\A(x)m  -  j4(.r*)m||  <  m8m~1\\A{x)  -  A(.r*)||. 

Proof.  We  proceed  by  induction.  For  rn  =  1,  the  result  follows  obviously.  Suppose 
that  the  result  is  true  for  rn  —  k.  Then 

||  A(x)k+1  -  yl(.r*)fc+1||  =  ||j4(.r)fc+1  -  A{x)kA{x*)  +  A{x)kA{x*)  -  yl(.r*)fc+1|| 

<  ||A(.r)fc(2l(.T)  -  A(.r*))||  +  ||(A(.r)fc  -  4(.r*)fr)y4(.r*)||. 

<  tk\\A{x)  -  2l(.r*)||  +  k.8k~1\\A(x)  -  A{x*)\\8  =  {k  +  l)dfc||A(.r)  -  4(.r*)||, 
and  the  proof  is  complete. 
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Lemma  3.  Let  x  £  Si  and  let  m  be  a  positive  integer,  then 

777 

i  -  n H/^ = e= o,  i, ... . 

1= 1 

Proof.  Let  x  £  Si,  then  from  (7) 

B(iS(x)F'{x)  =  I-Ht,a{x),  a  =  1,2,. ..,m,  (.=  0,1,...,  (11) 

where  JJ>i#(x)  and  B,,s(.r)  are  defined  in  (5)  and  (6)  respectively.  Then  from  (9) 
and  (11)  we  obtain 

m— 1  m 

A(,m{*)F'{x)  =  e  n  H(j{x)  (I  -  Ht,i{x))  +  (/  -  He,m(x)) 

i—l 

m  —  1  /  m  777  \ 

=  n  ~  +  v  -  HLm{F)) 

7=1  y=i+l  ) 

m 

= i  -  n 

i=i 

and  the  proof  is  done. 

Lemma  4.  Let  x  e  Si  and  let  rn  be  a  positive  integer,  then 

m  rn.  rn. 

II  ft  He  Jr.)  -  I]  HeJxJW  <  a""1  £  \\HtJx)  -  Hej{x*)\\,  £  =  0, 1, ... . 

l=i  l=i  i=1 

Proof.  In  order  to  show  this  result,  we  proceed  by  induction.  Obviously,  the 
result  follows  for  m  =  1.  Suppose  that  the  result  is  true  for  rn  =  k.  Then  taking 
into  account  that  ||^,s(.t)||  <  a,  we  can  write 

k.+ 1  AH-1  fc+l  JL 

\\l[H(Jx)  -  H(Jx*)\\  =  ||  n  HeJJ  -  He,k+i(x)  H 

3  =  1  1  =  1  1  =  1  ^  =  1 

k  fc+1 

+  He.k+i(x)  n  Htj (X*)  -  n  HiAx*) || 
l=i  l=i 

k.  k 

<  ||H,lfc+i(x)||||  n  Hi  Ax)  -  n  Hi  Ax*)  II 

1=1  1=1 

k. 

+  \\Ht,k+1{x)  -  He,k+ i(x*)||||  n 

1=1 

A:  fc 

<  o ||  H  Htj (x)  -  n  Htj(x*)\\  +  ll^,fc+i(*)  -  He,k+i{x*)\\(yk 

1=1  1=1 
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<  E  WHiA*)  -  HtA**) II  +  '■*  :  Ht,k+1(x)  -  ^,fc+1(.T*)|| 

3  =  1 

fc+1 

=  akYJ\\HM-H^)l 

3= 1 

and  the  proof  is  complete. 

Lemma  5.  Suppose  assumptions  (i)-(v)  are  satisfied.  Assume  further  that  the 
sequence  of  number  of  local  iterations  q(£,  s,  k),  (  -  0, 1, ,  s  =  1,2,...,  m.(, 
1  <  k.  <  p,  remains  bounded  by  q  >  0.  Then,  there  exists  L*  >  0  such  that,  for 
any  x  £  S\  and  for  any  positive  integer  s,  it  follows 

II Hf,s(x)  -  HU**)\\  <  L* ||.r  -  x* || ,  t  =  0, 1, . . . . 

Proof.  Let  x  £  Si,  from  (iii),  (iv)  y  (v)  it  is  known  (see  e.g.,  [12])  that  there 
exists  rk>  0,  1  <  k  <  p,  such  that 

||M-1(x)fVfc(.T)  -  M^(x*)Nk(x*)  ||  <  rk\\x  -  ,r*||.  (12) 

On  the  other  hand,  if  we  denote  Rk{x)  —  Mfi1(x)Nk\x),  using  Lemma  2  and 
(10),  it  obtains 

II Rk{x)q{e's'k)  -  Rkfx*)^^ ||  <  q^SikfK^'V-'WRkfx)  -  Rk(x.*)\\.  (13) 
Therefore  from  (10),  (12)  and  (13),  wc  have 

\\He.,s(x)  -  Hf,s{x*)  ||  <  £  ll^llll^f '  S'k\*)  ~  Rf's’k)(x*)  || 
k=  1 

P  p 

<^||^||g(Aa,fc)^(^)-1r/fe||.r_.r*||  <^1^11  (qK'g~1rk)\\  x  —  .t*||,(14) 

*=1  k= 1 

with  K'  =  maxjLLf}.  Then  ||^,,(a?)  -  Hts(x*)\\  <  L*\\x  -  .r*||,  with  L*  = 
Ell^fell  (qK'^rk). 

k= 1 

Lemma  6.  Let  assumptions  (i)-(vi)  hold  and  suppose  that  the  sequence  of  num¬ 
ber  of  local  iterations  q(f.,s,k),  t  =  0,1,...,  s  =  1,2 1  <  k  <  p, 
rem.ai.ns  bounded  by  q  >  0,  then  there  exists  ci  <  +oo,  such  that,  for  any  x.  £  Si, 

\\G(..m{x)  -  .r*||  <  d\\r.  -  .r*||2  +  am\\x  -  rr*||,  t.  =  0, 1, . .. , 

Proof.  From  Lemma  3  it  follows 
II  Gt.m(x)  -  ,r*||  -  ||.T  -  Apm{x)F{x)  -  x* || 

m  777 

<  II  -  A(,m(x)F(x)  +  (I  -  J]  x*))(x  -  ,t*)||  +  ||  U  Hftj (x*)(x  -  ,r*)|| 

3=1  j=l 

m. 

=  ||  -  Apm{x)F{x)  +  Afym{x*)F\x*)(x  -  .r*)||  +  ||  J]  Hej{x*){x  -  ,t*)||. 

.7  =  1 
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Then  by  assumption  (vi),  for  t  —  0, 1, . . it  obtains 

||G,,ro(x)  -  xl  <  ||  -  Af.m(x)F(x)  +  ^,m(.r*)F'(; x*)(x  -  x*)||  +  am ||.r  -  x*||. 

Now,  since  F(x*)  =  0,  we  have  the  following  inequalities 

|| Ge.m{x)  -  x* ||  <  ||  -  At,m{x)  (F(x)  -  F{x*)  -  F'{x*){x  -  x*))  || 
+||^,m(rr)  (F'(x)  -  F'(x*))  (x  -  .r*)|| 

+  ||  (A^m(x)F'(x)  -  A(tm(x*)F'(x*))  (x  -  ®*)|| 

+am||x  —  .T*||.  (15) 

On  the  other  hand  from  (9),  and  using  assumption  (vi)  we  have 


ll^,m(*)ll  = 


m— 1  m 

e  n  He,j{x)Be,i(x)  +  B(<m{x) 

i=  1  j=i+ 1 


m  —  1  m 

£  E»  n  ^,J(r)||||^,i(.T)||  +  ||B,,m(x)|| 

7=1  j=i  +  l 
m— 1 

<  ^om-<||i?/ll(x)||  +  ||^,m(x)||. 

1=1 


By  the  definition  of  Bf,s(x),  given  by  (6),  and  using  (10),  it  obtains 


(10) 


p  q{t,s,k)-l 

I|1M*)II<EM  Y  ll(A4"1(*)^fc(®))fc(a:)llllMfc"1(®)ll 

fe=l  h=0 

p  q(f.,s,k)- 1 

<£p*||  Kh\\\\M,\x)\\. 

k-1  h- 0 

That  is,  since  the  sequence  q(f.,s,k),  —  0,1,...,  s  =  1,2, ...  ,mf,  1  <  k  <  p, 

remains  bounded  by  q  >  0,  we  have 

||JM*)||  <  ^||^li£^ft||l|Mfc-1(.T)||.  (17) 

k= 1  h= 0 

Let  ft  -  mKX.{fti,fh,---,ftP},  where  ftk  =  sup{||Affe(:r)-1||  :  x  €  Si}.  The 

existence  of  ftk,  1  <  k  <  p,  follows  from  Lemma  1.  Then,  from  (17)  it  obtains 

n^,.(*)ii  <E  11^*11 

k-1  h= 0 


Thus, 


||£fiS(.r)||  <£*,  ^  —  0, 1, ... ,  *=l,2,...,m,,  (18) 
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P  9-1 

for  all  x  E  Si,  where  K*  =  53  ||jjfc||  53  Khfi  >  0. 

fc= 1  h= o 

Now,  by  (16)  and  (18),  for  any  x  £  Si  and  for  any  positive  integer  rn,  we  have 

\\M,m{x) ||  <  X]  +  F*  < 

i=l 

Now,  using  (19)  we  bound  (15).  From  Theorem  1,  it  follows 

||  -  At,m(x)(F(x)  -  F(x*)  -  F'(x*)(x  -  .r*))||  < 

<  II  -  ,m, (.t)  ||  ||  (.T  -  X*)\\  sup  ||  (F(x*  +  t(x  -  X*))  -  F'{x*))  || 

0<«<1 

<  II  -  Ar,m(.T)||||(.'r  -  .r*)||  sup  L\\t(x  -  ,t*)||  <  K* L\\{x  -  x*)||2. 

0<1<1 

On  the  other  hand,  by  condition  (iii)  we  have 

II AFm{r)  (F'(x)  -  F’(x*))  (x  -  ,r*)||  <  ||^,„,(.t)||||F,(.t)  -  FVDIIIIO*  -  ar*)|| 

<K*L\\{x-x*)f. 

Using  Lemmata  3,  4  and  5  it  obtains 


l  —  o 


+  1  K*  =  K* 


(19) 


(Af,m(x)F'(x)  -  At ,m (x* )F'(x*))(x  -  X 


II  HtAx)  -  I]  H(,j  {■£*)  I  (x  -  x* 

0=1  3  =  1  / 


<  a”1-1  5]  || Hej(x)  -  HfJ(: x*)\\\\x  -  x*\\  <  rnam~lV \\x  -  ,t*||2. 
1=i 


Since  a  <  1,  {mom  1 }  is  upper  bounded.  Let  C2  (dependent  of  or)  an  upper 
bound  of  this  set,  then  setting  c.\  —  2 K* L  +  c^L* ,  the  proof  is  complete. 

Rem, ark  1.  We  want  to  point  out  that  since  we  know  nothing  about  the  bound 
K  in  (10),  we  need,  in  lemmata  5  and  6,  the  sequence  q{f.,s,k.)  to  be  bounded 
by  q  >  0.  If  we  have  K  <  1,  then  we  do  not  need  that  upper  bound  for  the 
non-stationary  parameters  q(f.,s,k)  (sec  (14)  and  (17)). 

Theorem  2.  Let  assumptions  (i)  (vi)  hold  and  F(x*)  =  0.  Let  {mr}£L0  be  a 
sequence  of  positive  integers,  and  define 


1 1 

rn  =  max 

{mo}  U  <  m.f  -  53  m  :  £=1,2,.. 

•1 

1  i- 0 

JJ 

Suppose  that,  m  <  -foo  and  that,  the  sequence  of  non- stationary  parameters 
q{(.,  s,  k),  t  =  0, 1, . . . ,  a  =  1, 2, . . . ,  m.f  ,  1  <  k  <  p,  is  bounded  by  q  >  0.  Then, 
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there  exist  r  >  0  and  c  <  1  such  that,  for  .tO)  £  5”  =  {x  £  Et  :  ||a"  x,  ||  <  r} , 
the  sequence  of  iterates  defined  by  (8)  converges  to  x*  and  satisfies 

i|x(m)  _  x*||  <  cmt\\xf()  -  x* || . 

Proof.  Let  ci  be  as  in  Lemma  6.  Let  c.  >  0  be  such  that  a1/™  <  c  <  1.  Since 
a  <  cm .  tiicre  exists  0  <  r  <  ri,  such  that 

c\r  +  a<  cm, 


and  then, 

{dr  +  a)l/m  <c<  1. 

Now,  we  proceed  by  induction.  For  f.  —  1,  using  Lemma  6,  we  have 

Hxd)  -X*\\  =  ||Go,mo(.T(0))-.T*||  <  Ci||®(0)  -.T*||2  +  «m°||2:(0)  -X*\\ 

<  (car  +  amo)j|.T(0) 

Since  cir  +  am°  <  cir  +  a  <  cm  <  cm°,  then  ||*(1>  -  ®*||  <  cm°||a:(0)  -  **||. 
Therefore  the  result  follows  for  f.  =  l.  Suppose  that  the  result  is  true  for  0  <  £  < 
j.  Then 

j- 1 

|| XU)  _  x*||  <  c™i-i  ||.t;(j’_1)  -  x*  ||  <  H  cms||a-(0)  -  X*||. 

s= o 


Now,  for  (.  =  j  +  1,  from  Lemma  C  it  follows 


XU+P  -  x* ||  =  ||G,- ,mj(xw))  -®*||  <  (c1||.r^).—  x*|[  +  amj')||.r(j)  —  x* 


j-  1 


<  (ci  (J]  ems)||.x(0)  -  x*  ||  +  amj')||.T(-,')  -  x* jj 


s=0 


<  (cir(f[  cm-)  +  am')||*(J)  -  x*||  <  (c1romj~m  +  am' 

s= 0 

<  ((cm  -  a)t:mj~m  +ami)||.T(,”)  -  x*|| 

=  Cm*  ((cm  -  a)c~m  +  amP:-mi)\\x(l)  -  ®*|| 

=  cm>  (1  -  ac-m  +  amUrmi)  ||x(J)  -  x*|| 

=  cm^(l  +  -  l))||.r(j,)  -  .r*||. 


ixU)  _  x*  | 


Since  0  <  a  <  om  <  1,  then  ac~m  <  1.  On  the  other  hand,  0  <  ami~lcm  < 
j.mtmj-lJpm-mj  _  <  j  aiKJ  tllCll,  -1  <  OC_m  (omj_1Cm_mJ  -  1)  <  0. 

Therefore,  ||.r(-7+1)  -  x*\\  <  c™j  ||.r^l  -  ,r*||,  and  the  proof  is  complete. 

Theorem  3.  Let  assumptions  (i)-(iv)  hold  and  F(x.*)  =  0.  Let  be  a 

sequence  of  positive  integers,  and  define  rn  as  in  (20).  Suppose  that  m.  <  +oo. 
If  any  of  the  following  two  conditions  is  satisfied 

1.  F'(x*)  is  a  monotone  matrix  and  F'(x*)  =  Mk{x*)  —  Nk(x*),  1  <  k  <  p,  are 
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weak  regular  splittings, 

2.  F'{: x*)  is  an  H -matrix,  F'(x*)  =  Mk{x*)  -  Nk(x*),  1  <  k  <  p,  are  H- 
compatible  splittings, 

then,  there  exist  r  >  0  and  c  <  1  such  that,  forx (0)  e  S  =  {x  e  HI”  :  ||.r-.r*|[  < 
r},  the  sequence  of  iterates  defined  by  (8)  converges  to  x*  and  satisfies 

||>T(<’+1)  _  _T*||  <  _  -T*||_ 

Proof.  Under  conditions  1  and  2  and  taking  into  account  respectively,  the  proofs 
of  Theorem  2.1  of  [4]  and  Theorem  3.1  of  [9]  we  obtain  that  assumptions  (v), 
(vi)  and  (10)  with  K  <  1  are  satisfied.  Then,  the  proofs  follow  from  Theorem  2 
and  Remark  1. 

3  Numerical  Experiments 

We  have  implemented  the  above  method  on  two  distributed  multiprocessors. 
The  first  platform  is  an  IBM  RS/6000  SP  with  8  nodes.  The  second  platform  is 
an  Ethernet  network  of  five  120  MHz  Pentiums.  In  order  to  manage  the  parallel 
environment  we  have  used  the  PVMe  library  of  parallel  routines  for  the  IBM 
RS/6000  SP  and  the  PVM  library  for  the  cluster  of  Pentiums  [7],  [8]. 

In  order  to  illustrate  the  behavior  of  the  above  algorithms,  we  have  considered 
the  following  scmilincar  elliptic  partial  differential  equation  (see  e.g.,  [5],  [121, 

[14]) 

~(K1ux)x  -  ( K2uy)v  ~  -geu 
u  =  x 2  +  y2 

where 

F1  =  K1(x,y)  —  1  +  ,r2  -f  y2, 

K2  =  K2(x,y)  =  l  +  ex  +  e«, 

9  =  S(-U2/)  =  2(2  +  3.r2  +  y2  +  ex  +  (1  +  y)ev)erx2~y2 , 
fl=(0,l)x(0,l). 

It  is  well  known  that  this  problem  has  the  unique  solution  u(x,y)  =  x2  +  y2. 
To  solve  equation  (21)  using  the  finite  difference  method,  we  consider  a  grid 
in  Q  of  d?  nodes  equally  spaced  by  h  =  Ax  =  Ay  =  ^ryy .  This  discretization 
yields  a  nonlinear  system  of  the  form  Ax  +  F(x)  =  b ,  where  F  :  IR”  ->  IR"  is 
a  nonlinear  diagonal  mapping  and  A  is  a  block  tridiagonal  symmetric  matrix 
^  ~  (A:-i,  Ti,  Dj)f=1,  where  T,  are  tridiagonal  matrices  of  size  d  x  d,  i  — 
1, 2, . . . , d,  and  Di  are  d  x  d.  diagonal  matrices,  i  =  1, ...  ,d  —  1;  see  e.g.,  [5].  Let 

v 

S  =  {1, 2, . . . ,  n}  and  let  Sk,  k=  1, 2, . . .  ,p,  be  subsets  of  S  such  that  S  =  [J  Sk. 

If  — l 

Let  us  further  consider  a  multisplitting  of  F'(. x),  where  F(x)  =  Ax  +  F(x)  -  b , 
of  the  form 

{D(x)-Lk,Uk,Ek}pk=l,  where  Lk  =  {  i  <  ^nd  j  £  Sk, 

[  0,  otherwise, 


(.r,  y)  6  n, 
{x,  y)  €  dft, 


(21) 
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with 

k  V 

=  +  1  <fc<P>  X)nfc  =  O’ 

j<fc  i=l  fc=1 

D(x)  =  diag(A)  +  diag($i(a:i), . . .  ,#«(*«)), 

and  the  nxn  nonnegative  diagonal  matrices  Ek ,  1  <  fc  <  p,  arc  defined  such  that 
their  ith  diagonal  entry  is  null  if  itf.Sk-  Note  that  this  multisplitting  is  a  Gauss- 
Seidel  type  multisplitting.  The  stopping  criterion  used  was  ||.rw  -  v\\2  <  h  , 
where  ||  •  ||2  is  the  Euclidean  norm  and  v  is  the  vector  which  entries  are  the 
values  of  the  exact  solution  of  (21)  on  the  nodes  (■ ih,jh .),  i,j  =  1,  •  •  •  ,d  and  the 
initial  vector  was  :r(0)  =  (1, . . . ,  1)T.  All  times  arc  reported  in  seconds. 

We  have  run  our  codes  with  matrices  of  various  sizes  and  different  multi¬ 
splittings  depending  on  the  number  of  processors  used  (p)  and  the  choice  of 
the  values  nk,  1  <  k  <  p,  but  to  focus  our  discussion,  we  present  here  results 
obtained  with’ d  =  64,  that  originates  a  nonlinear  system  of  size  4096.  The  con¬ 
clusions  wc  present  here  can  be  considered  as  representative  of  the  larger  set  of 
experiments  performed. 


30 

20 


Time 


,  q=l  27.59  15.91  4.73  7.73 

Bq=4  1  8.08  5.11  2.35  4.11 

q=9  4.93  3.43  2.27  L86 


Fig.  1.  Non-stationary  parallel  Newton  Gauss-Seidel  methods 


Figure  1  shows  the  behavior  of  some  noil-stationary  parallel  Newton  iterative 
methods  on  an  IBM  RS/6000  SP  multiprocessor  using  four  processors  and  nk  = 
1024,  1  <  k  <  4.  This  figure  illustrates  the  influence  of  the  non-stationary 
parameters  q(k )  =  q,  1  <  k  <  4,  in  relation  to  m.(  —  1,  2,  P,  2  .  We  want 
to  note  that,  for  a  fixed  number  of  processors,  the  computational  time  starts  to 
decrease  as  the  non-stationary  parameters  increases  until  some  optimal  value  of  q 
{q  =  9,  in  Figure  1)  after  which  time  starts  to  increase.  This  behavior  is  typical  of 
non-stationary  methods;  see  e.g.  [6]  and [9].  In  general,  this  optimal  value  is  hard 
to  predict  but  if  the  decrease  in  the  iterations  balances  the  realization  of  more 
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local  updates  then  less  execution  time  is  observed.  This  situation  is  independent 
of  the  choice  of  rrif  .  On  the  other  hand,  in  this  figure  it  can  also  be  observed 
that  the  best  non-stationary  parallel  methods  were  obtained  setting  me  =  i  and 
me  =  2e. 


30 

20 

Time 

10 

0 


:Eq=l 
m  q=2 
Hq=10 
T  q=16 


Cluster  mf= f  IBM  SP  rn,  =f 


28.33  5.30 

18.99  3.93 

14.01  3.01 

14.54  3.16 


Fig.  2.  Cluster  of  Pentiums  versus  IBM  RS/6000  SP  (2  processors) 


Figure  2  shows  the  behavior  of  some  non-stationary  parallel  Newton  iterative 
methods  in  relation  to  the  parallel  computer  system  used.  In  this  figure  we  have 
used  two  processors,  n k  =  2048,  k  =  1,2,  and  m.f  =  i.  The  conclusions  were 
similar  on  both  multiprocessors,  however,  the  computing  platform  has  obviously 
an  influence  in  the  performance  of  a  parallel  implementation.  Note  that  when 
9=1,  the  method  reduces  to  the  well-known  parallel  Newton  Gauss-Seidel 
method  (see  [14])  and  as  it  can  be  appreciated  this  method  is  always  worse 
than  the  non-stationary  parallel  methods.  Moreover,  we  have  compared  these 
methods  with  the  algorithms  presented  in  [1].  We  have  observed  that  the  methods 
discussed  here  behave  better  than  those  algorithms.  For  example,  for  the  matrix 
of  size  4096,  the  best  time  we  have  obtained  with  the  IBM  RS/6000  SP  using- 
four  processors  (see  Figure  1)  is  1.86  seconds,  however  the  best  times  obtained 
with  the  other  methods  (see  Table  1  and  2  of  [1])  were  about  6  seconds. 

On  the  other  hand  in  Figure  3  we  have  compared  the  algorithms  of  this  pa¬ 
per,  setting  7  =  9,  with  the  well-known  sequential  Newton  Gauss-Seidel  method 
[11]  versus  the  number  of  processors  in  the  IBM  RS/6000  SP.  The  best  CPU 
time  performed  by  this  sequential  method  was  obtained  with  rne  =  l.  So,  if  we 
calculate  the  speed-up  setting  such  sequential  method  as  reference  algorithm 
(\  P  CPU  time  of  sequential  Newton-Gauss  Seidel  algorithm.  . 

REAL  time  of  parallel  algorithm  can 

obtained  an  efficiency  ( ^T^JSs  munber  ^  about  90%  with  two  processors 
and  about  60%  with  four  processors.  Similar  efficiencies  were  obtained  for  the 
cluster  of  Pentiums.  However  it  does  not  happen  the  same  with  the  parallel  New- 
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Fig.  3.  Non-stationary  methods  (q  =  9)  and  sequential  Newton  Gauss-Seidel  method 


ton  Gauss-Seidel  method  ([14]).  That  is,  if  q  =  1,  we  have  obtained  efficiencies 
only  about  0  -  30%  in  both  multiprocessors. 


6 


5  J . — 


0 


0.9  1.1  1.3  1.5  1.7  1.9 


Nevv-SOR 

q=2 

q=3 


Fig.  4.  Non-stationary  Newton-SOR.  methods 


Finally,  Figure  4  illustrates  the  influence  of  the  relaxation  parameter  u>  when 
non-stationary  parallel  Newton-SOR  methods  are  used.  In  this  figure  we  have 
considered  some  non-stationary  parallel  Newton-SOR  methods  using  four  proces¬ 
sors,  nk  =  1024,  1  <  k  <  4,  and  mf  =  and  for  each  one  we  recorded  the  REAL 
time  in  seconds  on  the  IBM  RS/6000  SP.  Moreover,  these  results  were  compared 
to  the  corresponding  parallel  Newton-SOR  method  ([14]).  As  it  can  be  appreci¬ 
ated  the  conclusions  were  similar  to  those  described  along  this  section. 
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Abstract.  In  this  work  the  fan-in  and  fan-out  algorithms  for  Cholesky 
factorization  of  sparse  matrices  on  distributed  memory  systems  are  adap¬ 
ted  for  modified  Cholesky  factorization  and  improved  to  reduce  idle 
times.  The  behavior  of  the  new  algorithms  has  been  evaluated  on  two 
machines  with  significantly  different  ratios  between  processor  speed  and 
communications  speed,  the  Fujitsu  AP1000  and  the  Cray  T3E. 


Keywords:  modified  Cholesky  factorization,  sparse  matrices,  distributed 
systems 

1  Introduction 

The  modified  Cholesky  factorization  of  a  symmetric  matrix  A  £  Rnxn  (not  nec¬ 
essarily  positive  definite)  is  a  Cholesky  factorization  of  A  =  A  4-  E  —  LDL  , 
where  E  is  a  non-negative  diagonal  matrix  such  that  A'  is  positive  definite  [6]. 
This  technique  is  appropriate  when  the  modification  of  a  linear  system  is  justi¬ 
fied,  as  in  Newton  methods  used  in  nonlinear  optimization  problems. 

The  standard  Cholesky  factorization  may  be  computed  in  parallel  using  sev¬ 
eral  methods  depending  on  the  access  and  updating  order  of  the  matrices.  The 
main  proposals  may  be  classified  into  three  basic  types:  the  fan-out  [5],  fan-in  [1] 
and  multifrontal  [7]  methods.  We  present  the  fan-in  and  fan-out  algorithms  for 
the  modified  Cholesky  factorization  on  distributed  memory  systems,  together 
with  modified  versions  reducing  processor  idle  time.  Fan-in  and  fan-out  versions 
for  the  modified  Cholesky  factorization  on  NUMA  shared  memory  systems  can 
be  found  in  [10]  and  [11]. 

Recently  great  efforts  have  been  made  to  find  efficient  block  oriented  im¬ 
plementations  of  sparse  codes  on  distributed  memory  systems,  examples  be¬ 
ing  the  block  fan-out  method  proposed  by  Rothberg  [13],  and  the  block  fan-in 
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method  proposed  by  Dumitrescu  et  al.  [3].  Compared  to  the  column-oriented  ap¬ 
proaches,  block-oriented  distributed-memory  sparse  Cholesky  factorization  ben¬ 
efits  from  a  reduction  in  interprocessors  communication  volume.  Unfortunately, 
block-oriented  approaches  suffer  from  poor  balance  of  the  computational  load 
and  they  do  not  fit  for  all  the  matrices,  for  example,  they  can  not  be  efficiently 
applied  to  random  matrices.  In  any  case,  the  communication  pattern  generated 
by  a  block  algorithm  is  the  same  as  its  column  counterpart,  and  therefore,  the 
strategies  here  presented  can  be  generalized  to  the  block-oriented  approaches. 

This  paper  is  organized  as  follows.  In  Sect.  2  the  sequential  modified  Cholesky 
factorization  algorithm  is  introduced.  In  Sect.  3  and  4  the  well-known  fan-out  and 
fan-in  algorithms  for  parallel  modified  Cholesky  factorization  of  sparse  matrices 
are  briefly  looked  at  and  the  modifications  that  reduce  idle  times  are  presented. 
In  Sect.  5  the  results  of  trials  carried  out  with  a  number  of  sparse  matrices 
on  two  distributed  memory  machines  with  significantly  different  ratios  between 
processor  speed  and  communications  speed,  the  Fujitsu  AP1000  and  the  Cray 
T3E,  are  presented  and  discussed;  and  in  Sect.  6  our  conclusions  are  summarized. 

2  Modified  Cholesky  Factorization:  Sequential  Algorithm 

Because  of  the  way  in  which  it  imposes  positive  definiteness  by  addition  of  a  di¬ 
agonal  matrix  E,  generalized  Cholesky  factorization  is  most  conveniently  treated 
as  a  modification  of  the  factorization  A  =  LDLT ,  where  D  is  a  diagonal  matrix 
and  L  is  a  lower  triangular  matrix  with  ones  on  the  diagonal,  rather  than  as  a 
modification  of  the  standard  Cholesky  factorization  A  —  L J A'  (where  there  is 
no  constraint  on  the  diagonal  of  L).  E  is  not  calculated  separately  and  added 
to  A  before  factorization  of  an  explicit  matrix  A'  =  A  +  E:  instead,  it  is  added 
implicitly  by  computing  D  and  L  directly  from  A  in  such  a  way  that  LDLT  is 
positive  definite  and  the  factors  are  all  bounded.  This  is  achieved  by  ensuring 
that  the  elements  of  D  and  L  satisfy  the  conditions 

dk  >  5  (1) 

Whk\/dkM  <  P  i  >  k  (2) 

where  5  is  a  small  positive  quantity  and  0  is  calculated  from  the  largest  absolute 
values  of  the  diagonal  and  off-diagonal  elements  of  A  in  such  a  way  as  to  minimize 
an  upper  bound  on  H^Hoo  while  ensuring  that  E  =  0  if  A  is  ’’sufficiently”  positive 
definite  [6],  In  the  usual  in-place  algorithm,  L  is  built  up  row  by  row.  Sparse 
matrices  are  stored  using  the  CSS  format,  that  is  a  column- wise  storage  [12].  A 
sequential  algorithm  to  which  sparse  matrix  techniques  are  more  easily  applied, 
and  which  is  more  closely  related  to  the  fan-in  and  fan-out  parallel  algorithms 
for  Cholesky  factorization  of  sparse  matrices,  is  the  following,  in  which  L  is  built 
up  column  by  column: 
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for  j=l  to  n  do 

calculate  6j  =  ma£se{j+i,...,n}{|“sj|} 
compute  dj  =  max{S,\ajj\,8j / /?  } 
update  the  diagonal  of  A  for  s  >  j: 

Qss  —  ass  aSj/dj  s  ...,  n 

update  off-diagonal  elements  of  .4  while  computing  column  j  of  L: 
for  s  =  j  +  1  to  n 
Isj  —  asj/dj 
for  i  =  s+lton 
Oks  =  aks  IsjOkj 
endfor 
endfor 
endfor 


(3) 

(4) 

(5) 

(6) 
(7) 


Implementations  of  modified  Cholesky  factorization  that  are  intended  for  the 
factorization  of  sparse  matrices  stored  by  columns  naturally  make  use  of  the  fact 
that  column  s  only  needs  to  be  updated  by  column  j  if  lsj  ^  0  and  there  exists 
some  non-zero  akj  {k  >  s).  The  elimination  tree  of  the  matrix  provides  precise 
information  about  dependences  among  columns  [9].  Such  implementations  will 
also  precede  the  code  shown  above  with  a  stage  in  which,  to  reduce  the  fill-in  of 
L,  A  is  reordered  (for  example,  by  means  of  the  widely  used  minimum  degree  [4] 
scheme),  and  by  a  symbolic  factorization  stage  that  determines  the  pattern  of  L 
for  the  purposes  of  memory  assignment.  However,  in  this  paper  we  concentrate 
on  the  actual  numerical  calculations,  which  are  the  most  time-consuming. 


3  Fan-out  Methods 

In  fan-out  methods,  computation  is  data-driven:  as  a  processor  receives  the  data 
it  needs,  it  progressively  computes  the  diagonal  element  (dj)  and  L*j  (the  j-th 
column  of  L),  and  as  soon  as  it  has  completed  this  task  it  sends  the  diagonal 
and  the  column  to  all  the  processors  that  require  this  column  to  perform  modi¬ 
fications.  The  necessary  operations  to  update  columns  s  (s  >  j)  depending  on  j 
are  computed  on  the  receiving  processors.  The  algorithm,  in  a  simplified  way,  is 
shown  in  Fig.  1,  where  mycols(P)  is  the  set  of  indices  of  the  columns  for  which 
processor  P  is  responsible,  ncol(P)  is  initially  the  cardinality  of  mycols(P), 
nmod(s)  is  initially  the  number  of  columns  j  <  s  that  really  need  to  be  used  in 
computing  column  s,  and  users(j)  is  the  set  of  indices  of  the  columns  that  need 
column  j  for  their  computation;  nmod(j)  and  users(j)  can  be  calculated,  before 
execution  of  the  numerical  factorization  algorithm,  by  using  the  elimination  tree 
of  A. 

We  propose  a  modification  to  this  algorithm,  the  fan-out  method  with  pre¬ 
multiplication,  in  which  the  computations  on  each  column  are  performed  on  the 
sending  processor.  If  this  is  done,  the  number  of  interprocessor  communications 
is  greater  than  in  FO,  but  the  overlap  between  calculations  and  communications 
is  increased,  and  processor  idle  time  is  reduced.  We  propose  the  algorithm  in 
Fig.  2,  where  the  vector  <  js  >  is  defined  by  <  js  >—  {hjhjdj}ke{s- t-i,...,n}- 
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for  all  j  6  mycols(P)  such  that  nmod(j)  =  0  do 
compute  dj  (Eq.  3  and  4) 
compute  L,j  (Eq.  6) 
ncol(P)  —  ncol(P)  —  1 

send  dj  and  L,j  to  the  processors  responsible  for  the 
columns  with  indices  in  users(j) 

endfor 

while  ncol(P)  >  0  do 

wait  for  reception  of  a  column  j 
for  s  6  mycols(P)  fl  users(j) 
update  column  s  (see  below,  Eq.  8) 
nmod(s)  =  nmod(s)  —  1 
if  nmod(s)  =  0  then 
compute  d,  (Eq.  3  and  4) 
compute  L,a  (Eq.  6) 
ncol(P)  —  ncol(P)  —  1 
send  d,  and  L to  the  processors  responsible  for 
the  columns  with  indices  in  users(s) 
endif 
endfor 
endwhile 


Fig.  1.  The  fan-out  method  (FO)  for  processor  P 


Vectors  <  js  >  for  which  s  £  mycols(P)  are  referred  to  as  non-local  products, 
self-messaging  has  also  been  suppressed,  and  in  consequence  a  certain  amount 
of  internal  traffic  control  is  necessary:  if  the  updating  of  column  s  by  column 
j  £  mycols(P(s))  completes  the  updating  of  column  s  (so  making  it  (almost) 
ready  to  be  used  to  update  other  columns)  before  column  j  has  finished  updating 
all  columns  s*  £  mycols(P(j))  nusers(j),  then  column  s  is  added  to  a  queue  of 
columns  waiting  to  be  used  for  updating. 

4  Fan-in  Methods 

The  main  weakness  of  the  fan-out  algorithm  is  the  large  interprocessor  commu¬ 
nications  volume  it  involves.  The  number  of  interprocessor  communications  can 
be  reduced  if  the  contributions  to  column  s  by  all  columns  j  belonging  to  a  sin¬ 
gle  processor  P(j)  are  summed  before  being  sent  from  P(j)  to  P(s);  this  is  the 
idea  of  the  fan-in  strategy.  The  algorithm,  in  a  simplified  way,  is  shown  in  Fig.  3, 
where  suppliers(s)  is  the  set  of  column  indices  j  such  that  s  £  users(j),  u(P,  s )  is 
the  vector  accumulating  updates  to  column  s  involving  columns  j  £  mycols(P), 
and  pmods(s)  is  the  number  of  processors  P  providing  updates  to  column  s. 

hi  FI,  columns  are  computed  in  order,  with  the  result  that  high- index  columns 
are  not  updated  at  all  until  all  lower-index  columns  have  been  computed.  To 


-360- 


VECPAR  ’2000  -  4th  International  Meeting  on  Vector  and  Parallel  Processing 


for  all  j  6  mycols(P)  such  that  nmod(j)  =  0  do 
compute  dj  (Eq.  3  and  4) 
compute  L . j  (Eq.  6) 
ncol(P)  =  ncol(P)  —  1 
compute  and  send  non-local  products 
for  s  E  mycols(P)  nusers(j')  do 
compute  <  ja  >  and  update  column  s 
nmod(s)  =  nmod(s)  -  1 
endfor 
endfor 

while  ncol(P)  >  0  do 

wait  for  reception  of  an  update  to  some  column  j  E  mycots(P) 
update  column  j 
nmod(j)  —  nmod(j)  —  1 
if  nmod(j)  =  0  then 
compute  dj 
compute  L,j 
ncol(P)  =  ncol(P)  —  1 
compute  and  send  non-local  products 
for  s  €  mycols(P)  n  users(j) 

compute  <  js  >  and  update  column  s 
nmod(s)  =  nmod(s)  —  1 
if  nmod(s)  =  0  then 
add  column  s  to  the  queue 
endif 
endfor 
endif 

while  queue  not  empty  do 

get  next  column  from  queue  (column  j .  say) 

compute  dj 

compute  L.j 

ncol(P)  =  ncol(P)  —  1 

compute  and  send  non-local  products 

for  s  E  mycols(P)  n  users(j) 

compute  <  j,  >  and  update  column  s 
nmod(s)  =  nmod(s)  —  1 
if  nmod(s)  =  0  then 

add  column  s  to  the  queue 
endif 
endfor 
endwhile 
endwhile 


Fig.  2.  The  fan-out  method  with  pre-multiplication  (PMFO)  for  processor  P 
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for  s  =  1  to  n  do 

if  s  €  mycols{P )  or  3  j  €  mycols{P)  n  suppliers(s)  then 
u{P ,  s)  =  0 

for  j  6  mycols(P)  D  suppliers(s)  do 
u{P,  s)  =  U(P ,  s)  4-  lnj)T 

endfor 

if  s  €  mycois(-P)  then 

(l S3,  ...,/na)  =  (flaaj  u(P,  s) 

while  pmod,8(s)  ^  0  do 

wait  for  reception  of  a  vector  u(P*,  5)  from  some  other  processor  P* 
-u(P*,s) 

pmods(s)  =  pmods(s)  —  1 
endwhile 
compute  ds 
L*s  —  L*s  /  ds 
else 

send  u(P ,  s)  to  P(s) 

endif 

endif 

endfor 


Fig.  3.  The  fan-in  method  (FI)  for  processor  P 


remedy  this,  the  data-driven  fan-in  method  is  proposed,  (Fig.  4).  This  algo¬ 
rithm  combines  the  low  message  count  of  the  fan-in  method  with  the  data- 
driven  character  of  the  fan-out  method.  This  can  be  achieved  by  updating 
columns  at  the  earliest  possible  moment.  The  variable  nlmod(s)  is  initialized 
as  | suppliers(s)  C\mycols(P)\.  DDFI  has  the  same  number  and  volume  of  inter- 
processor  communications  as  FI,  but  the  overlap  of  communications  and  com¬ 
putations  reduces  idle  times. 


5  Experimental  Results 


The  algorithms  described  above  have  been  implemented  on  two  distributed  mem¬ 
ory  parallel  computers,  the  Fujitsu  AP1000  [8]  and  the  Cray  T3E  [14],  some 
characteristics  of  which  are  listed  in  Table  1.  The  API 000  was  programmed  us¬ 
ing  its  native  message-passing  routines,  and  the  T3E  using  the  standard  MPI 
library.  Double  precision  floating  point  arithmetic  was  used  throughout. 

For  the  purpose  of  evaluating  the  algorithms,  the  salient  difference  between 
the  two  computers  concerns  the  ratio  between  processor  speed  and  interproces¬ 
sor  communications  speed:  1.25  FLOPs/byte  for  the  T3E  as  against  only  0.22 
FLOPs/byte  for  double-precision  calculations  on  the  AP1000.  This  difference  in 
the  relative  capacities  of  the  processing  and  communications  systems  means  that 
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while  ncol(P)  ^  0  do 

while  queue  not  empty  do 

get  next  column  from  queue  (column  j,  say) 

compute  dj 

compute  Z/»j 

ncol(P)  =  ncol(P)  —  1 

for  s  £  users(j)  do 

nlmod(s)  =  nlmod(s)  —  1 
if  nlmod(s)  =  0  then 
u(P,s)  =  0 

for  l  £  mycols(P)  n  suppliers(s)  do 
u(P,  s)  =  u(P,s )  +  ■■■dni)T 

endfor 

if  s  £  mycols(P)  then 

(l„,--Jns)T  =  (o„ . «».)T  -u(P,s) 

pmods(s)  =  pmods(s)  —  1 
if  pmods{s )  =  0  then 
add  column  s  to  the  queue 
endif 
else 

send  u{P)  s)  to  P(s) 

endif 

endif 

endfor 

endwhile 

wait  for  reception  of  a  vector  u(P*,j)  (j  €  mycols(P)  ) 
from  some  other  processor  P * 

~u(P*,j) 

pmods(j)  =  pmods(j)  —  1 
if  pmods(j)  =  0  then 
add  column  j  to  the  queue 
endif 
endwhile 


Fig.  4.  The  data-driven  fan-in  method  (DDFI)  for  processor  P 


the  attractiveness  of  the  fan-in  algorithm  relative  to  the  fan-out  algorithm  is  in 
principle  greater  for  the  T3E  than  for  the  API 000. 

The  performance  of  the  algorithms  was  evaluated  using  five  benchmark  matri¬ 
ces:  three  belonging  to  the  Harwell-Boeing  collection  [2]  (BCSSTM07,  ERIS1176 
and  ZENIOS)  and  two  randomly  generated  matrices  (RANDOM  and  RAN- 
DOM1).  Their  characteristics  are  listed  in  Table  2,  where  n-  is  is  the  number  of 
non-zero  entries. 
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Table  1.  AP1000  and  T3E  characteristics 


API  000 

T3E 

Number  of  processors 

4  to  1024 

16  to  2048 

Interprocessor 

networks 

Broadcast(50MB/s) 
2D  torus  (25MB/s) 
Synchronization 

3D  torus  (480  MB/s) 

Processor 

SPARC 

DEC  Alpha  21164 

Cache  memory 

128  KB 

1st  level:  8KB  inst./data 
2nd  level:  96  KB 

Local  memory 

16  MB 

64  MB  to  2  GB 

MFLOPs 

8.3  (single  precision) 
5.6  (double  precision) 

600 

Table  2.  Benchmark  matrices 


MATRIX 

n 

nz  in  A 

nz  in  L 

FLOPs 

BCSSTM07 

420 

3836 

14282 

579984 

ERIS1176 

1176 

9864 

49639 

3151680 

ZENIOS 

2873 

15032 

62105 

4865300 

RANDOM 

1250 

1153 

32784 

3698402 

RANDOM1 

2000 

1475 

106546 

23703166 

Figure  5  plots,  as  functions  of  the  number  of  processors,  the  number  of  inter¬ 
processor  communications  involved  in  the  factorization  of  the  matrices  by  each 
algorithm. 

Figures  6  and  7  show,  for  the  AP1000  and  T3E  respectively,  the  total  idle  time 
consumed  by  the  last  processor  in  terminating  its  task  during  the  factorization 
of  the  matrices.  The  idle  time  for  PMFO  remains  constant  or  decreases  for  N 
greater  than  about  4;  when  N  is  large  enough  the  idle  time  always  seems  to  be 
smaller  than  for  FO,  as  was  expected.  Similarly,  DDFI  generally  has  a  slightly 
smaller  idle  time  than  FI  (the  major  exception  concerns  the  factorization  of 
ZENIOS  on  the  AP1000). 

Figures  8  and  9  show,  for  the  API 000  and  T3E  respectively,  the  speed-up 
achieved  for  the  benchmark  matrices  by  each  algorithm  and  for  different  number 
of  processors.  For  N  greater  than  a  given  threshold,  PMFO  always  has  better 
speed-up  than  FO  on  the  AP1000.  On  the  T3E  its  extra  communications  burden 
outweighs  the  reduction  in  processor  idle  time,  at  least  for  N  <  16;  in  fact, 
PMFO  generally  has  the  best  speed-up  of  all  the  algorithms  on  the  AP1000  and 
the  worst  of  all  on  the  T3E.  With  regard  to  the  fan-in  algorithms,  the  DDFI 
algorithm  improves  slightly  on  FI  on  both  computers. 
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Number  of  processors 


FO  - 

PMFO - 

FI,  DDFI  . 


Fig.  5.  Number  of  messages  required  by  the  various  algorithms 
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Number  of  processors 


Number  of  processors 


FO  - 

PMFO  - 

FI 

DDFI  - 


Number  of  processors 


Fig.  6.  Idle  times,  in  seconds,  on  the  Fujitsu  AP1000 


Idle 


Speed  up  sPce<*  up 
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Number  of  processors 


Fig.  9.  Speed  up  on  the  Cray  T3E 


6  Conclusions 

In  this  work  the  fan-in  and  fan-out  algorithms  have  been  adapted  for  modified 
Cholesky  factorization  and  improved  to  reduce  idle  times.  The  modified  versions 
generally  obtain  better  results  than  the  unmodified  algorithms,  except  for  the 
case  of  the  modified  fan-out  algorithm  on  the  T3E.  Due  to  its  extra  communi¬ 
cations  burden,  this  algorithm  performs  worse  than  the  unmodified  algorithm  in 
communications-intensive  situations. 

The  behavior  of  these  algorithms  depends  on  the  features  of  the  systems  used 
to  execute  the  codes.  The  main  benefit  of  the  PMFO  algorithm  is  the  reduction  of 
idle  times  at  the  expenses  of  an  increase  in  the  number  of  communications.  That 
is  why  the  modification  proposed  for  the  FO  algorithm  (PMFO)  is  appropriate  in 
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systems  where  the  difference  between  computation  and  communication  speed  is 
not  very  high.  In  those  systems  in  which  the  communications  are  a  critical  factor 
it  would  be  more  convenient  to  use  the  algorithms  that  generate  a  lower  number 
of  communications,  that  is,  the  fan-in  algorithms.  Within  the  fan-in  algorithms, 
the  DDFI  algorithm  is  the  one  that  offers  the  best  performance. 

Bearing  in  mind  that  the  number  of  floating  point  operations  involved  in 
these  calculations  is  relatively  small,  the  speed-up  achieved  by  these  algorithms 
is  quite  considerable,  even  on  the  T3E. 
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Abstract.  It  has  been  known  for  some  time  that  groups  as  index  do¬ 
mains  of  indexable  container  types  provide  a  unified  view  for  “geometric” 
(grids)  and  “hierarchic”  (trees)  spatial  structures.  This  conceptual  uni¬ 
fication  is  the  starting  point  of  further  generalizations. 

In  this  paper  we  present  a  new  kind  of  index  domains  that  combine 
both  kinds  of  structure  in  a  single  index  domain.  Together  with  the 
“structured-universe  approach”,  these  new  index  domains  constitute  a 
framework  for  an  expressive  description  of  adaptive  multi-grid  discretiza¬ 
tions  and  algorithms. 

Keywords:  Programming  models,  data  parallelism,  container  types, 
structured-universe  approach,  multi-grid,  indexable  types,  groups. 


1  Introduction:  Infinite  Index  Domains  and  the 
“Structured-Universe  Approach” 

As  well  known,  virtual  memory  allows  for  a  dynamic  extensibility  of  data  struc¬ 
tures  like  stacks  and  heaps  under  preservation  of  their  logical  contiguity  in  the 
address  space.  The  memory-management  unit  (MMU)  inserts  an  abstraction 
layer  which  maps  a  finite  number  of  finite  substructures  (the  “pages”)  of  a  con¬ 
ceptually  infinite  address  domain  (IN0)  onto  some  physical  representation. 

The  structured-universe  approach  is  a  high-level  container  type  concept  with 
a  similar  kind  of  abstraction  as  virtual  memory  [12].  Its  data  types,  called  “power 
types” ,  are  indexable  types  with  infinite  index  domains  and  a  distinguished  de¬ 
fault  “zero  value”  for  the  element  type  (0.0  for  REAL,  etc.). 

By  appropriate  operands  and  data  parallel  operations,  arbitrary  elements  of 
power-type  variables  can  be  overwritten,  finitely  many  at  a  time.  Thus,  power- 
type  variables  always  have  finitely  many  non-zero  elements  (the  black  in 
Fig.  1);  this  property  js  somewhat  reminiscent  of  infinite-dimensional  vector 
spaces.  The  state-changing  operations  can  alter  finite  substructures  indexed  by 
chunks  of  any  shape  and  size  and  at  any  location  in  the  index  domain  (in  contrast 
to  the  allocation  of  fixed  pages).  This  allows  for  a  convenient  modeling  of  dy¬ 
namic  and  irregular  data  structures  under  preservation  of  their  logical  contiguity 

*  This  work  was  supported  by  the  Real  World  Computing  Partnership  (RWCP), 
Japan. 
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element  type  A  with  zero  value  “0” 
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Fig.  1.  The  “structured-universe  approach”:  Indexable  types  with  infinite  index  do¬ 
mains  and  a  default  “zero”  value  for  the  element  type.  For  variables,  the  supports  are 
restricted  to  be  finite 


and  neighbourhood  structure  in  their  global  problem-specific  index  domain,  and 
leads  to  compact  programs  that  are  close  to  the  problem’s  underlying  mathe¬ 
matical  formulae. 

Both  burden  and  freedom  of  setting  up  the  internal  technical  representa¬ 
tion  for  this  “shape-and-granularity  polymorphism”  are  then  transferred  to  the 
underlying  abstract  machine,  which  has  to  act  as  something  like  an  “Index  Do¬ 
main  Management  Unit”  (in  analogy  to  the  MMU).  The  structural  information 
necessary  to  do  so  efficiently  on  a  distributed-memory  machine  (especially  lo¬ 
cality  information)  is  contained — partially  statically  and  partially  dynamically, 
depending  on  the  nature  of  the  application— in  the  index  domains,  the  data  and 
communication  patterns,  and  the  operations  with  them. 

An  approach  of  preserving  problem-specific  structure  of  index  domains  is 
worth  as  much  as  the  latter  indeed  have  something  in  them  that  is  worth  to 
be  preserved.  Therefore  the  structured-universe  approach  is  equipped  with  a 
variety  of  problem-specific  index  domains,  which  are  infinite  and  more  general 
than  usual  also  in  other  ways  to  be  seen  later.  A  non-obvious  example  of  these 
index  domains  is  the  topic  of  this  paper. 

Overview:  In  Sect.  2  and  3,  we  analyze  the  formal  properties  of  index  do¬ 
mains  in  general  and  for  multi-grid  data  in  particular.  In  Sect.  4  through  6,  we 
sketch  a  small  sample  problem,  an  algorithm,  and  program  text,  and  summa¬ 
rize  the  relations  between  the  respective  abstract  properties  of  the  application 
and  the  programming  model  employed.  In  Sect.  7  and  8,  we  make  comparisons, 
summarize,  and  draw  conclusions. 

2  What  Accounts  for  the  “Right”  Index  Domain, 
and  Why? 


The  index  domains  effect  a  problem-specific  geometrization  of  container  data. 
As  for  the  “right”  index  domains,  for  instance  we  intuitively  feel  that  a  two- 
dimensional  grid  should  be  modeled  by  a  two-dimensional  array,  and  that  its 
mapping  onto  a  one-dimensional  address  space  should  be  done  by  the  compiler. 
Analogous  considerations  hold  for  higher  dimensions  and,  as  we  shall  see,  can  also 
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be  applied  to  structures  that  are  usually  not  perceived  as  indexable  ones,  such  as 
trees.  A  formalization  of  this  intuition  leads  to  following  criteria  of  naturalness 
of  index  domains: 

1.  Nearest-neighbour  relations  must  correspond  to  index-arithmetically  small 
distances.  Simultaneous  nearest-neighbour  communications  must  be  “paral¬ 
lel  shifts”  of  data  within  an  index  domain. 

2.  Multiple  non-elemental  substructures  of  a  power-type  entity  must  be  index¬ 
able  meaningfully  by  multiple  congruent  subsets  of  the  index  domain  (e.g., 
the  rows  in  a  matrix). 

(Multiple  non-elemental  substructures  occur  for  instance  in  routine  liftings 
that  express  nested  parallelism.  Multiple  substructures  of  congruent  shapes 
correspond  to  what  other  container-type  concepts  express  by  multiple  sub¬ 
structures  of  the  same  type  [10].) 

If  the  structured-universe  approach  is  used  with  the  right  index  domains,  irreg¬ 
ularity  and  dynamicity  of  spatial  structures  typically  go  into  the  supports  of  the 
data  (the  black  in  Fig.  1),  while  the  communication  patterns  and  data  decom¬ 
position  schemes  retain  their  regularity  in  the  infinite  index  domains.  Pointers 
and  indirect  indexing— which  are  the  structureless  “spaghetti  implementation 
techniques  in  this  field— need  to  be  employed  less  frequently. 

Groups  as  index  domains.  It  has  been  known  for  some  time  that  finitely 
generated  groups  constitute  a  unified  index  domain  concept  for  grids  and  trees 
in  the  sense  explained  above  [5, 10].  The  following  correspondences  hold  between 
spatial  structures  and  the  (infinite)  groups  into  which  they  are  embedded  as 
substructures: 

grids  C  free  Abelian  groups  + 

;  degree  of  commutativity  (1) 

trees  C  free  groups 

Now  with  groups  as  index  domains,  the  parlance  changes  a  bit: 

1.  The  role  of  describing  “small  distances” ,  formerly  played  by  (tuples  of)  small 
integers,  is  now  played  by  (sums  of  few  of)  the  generators  of  the  group. 

2.  “Parallel  shifts”  within  an  index  domain,  and  congruence  of  subsets,  are  de¬ 
fined  in  terms  of  the  respective  group  operation,  here  generically  written  ©  . 

For  integer  grids,  these  terms  still  coincide  with  the  intuitive  understanding. 
Analogously  for  trees,  the  neighbourhoods  characterized  by  the  generators  of 
the  group  are  those  between  parent  and  child,  and  communications  between 
parents  and  their  respective  childs  are  described  by  parallel  shifts  of  data  within 
the  index  domain  by  a  small  index-arithmetic  distance.  The  non- commutativity 
of  the  corresponding  groups  reflects  the  special  geometry  of  trees,  which  after 
all  is  different  from  that  of  grids. 

In  short,  groups  as  index  domains  constitute  a  unification  of  container  struc¬ 
tures  that  are  commonly  regarded  as  quite  different.  And  even  better,  this 
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tions.  Observe  the  close  interaction  of  the  grid-like  and  the  tree-like  spatial  structures, 
as  formalized  by  relation  (2) 


unification  is  the  starting  point  for  a  further  generalization,  which  we  now  begin 
to  introduce. 


Degree  of  commutativity.  We  begin  with  a  remark  about  commutativity  of 
groups.  There  are  several  ways  to  attribute  a  gradated  “degree  of  commutativity” 
to  groups,  as  opposed  to  a  mere  Abelian-or-not  classification.  In  all  of  these  ways, 
Abelian  groups  and  free  groups  mark  the  opposite  extreme  cases.  So  it  appears 
to  be  natural  to  investigate  whether  “intermediate”  groups  between  the  extremes 
serve  some  purpose.  This  is  indeed  the  case,  and  one  of  the  possibilities  to  fill  in 
the  ellipsis  in  (1)  is  the  kind  of  groups  we  are  going  to  present  in  the  next  section. 
It  is  not  much  of  a  surprise  that  this  kind  of  groups  exhibits  an  amalgamation 
of  both  grid-like  and  tree-like  spatial  structures  in  the  same  index  domain. 

3  The  Index  Domain  for  Multi-grid  Data 

Multi-level  methods  (methods  that  employ  multi-level  discretizations)  occur  in 
various  fields.  They  are  renowned  for  their  efficiency  and,  in  the  case  of  the  dy¬ 
namically  adaptive  variant  on  distributed-memory  machines,  notorious  for  their 
difficulty  of  programming.  They  are  treated  in  more  depth  e.g.  in  [2,6];  here  we 
just  mention  that  their  characteristic  property  is  the  combined  use  of  discretiza¬ 
tions  of  the  same  physical  space  at  different  levels  of  resolution.  The  algorithms 
typically  employ  both  intra-level  and  inter-level  communications.  Here  we  con¬ 
fine  ourselves  to  geometric  multi-grid  methods. 

Our  starting  point  is  the  observation  that  the  spatial  resolution  of  the  dis¬ 
cretization  (usually)  doubles  in  the  transition  from  one  level  to  the  next  one.  For 
an  illustration  we  assume  a  two-dimensional  integer  grid  (index  domain  ZZ2)  and 
use  the  term  “one  level  down”  for  the  transition  to  the  next  level  with  doubled 
resolution.  Then,  in  order  to  cover  a  certain  distance  x  at  one  level  farther  down, 
we  have  to  go  twice  as  many  steps.  (E.g.,  first  going  east  one  step  and  then  going 
down  is  the  same  as  going  down  first  and  then  going  east  two  steps;  see  Fig.  2). 
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This  observation  can  very  well  be  formalized  as  a  relation  within  a  non- 
Abelian  group: 

x  ®  down  —  down  ©  x  ®  x  for  all  x  £  7Z2  .  (2) 

So  we  construct  the  index  domain  for  a  multi-level  discretization  of  a  two- 
dimensional  domain  as  follows:  The  group  ^2  is  extended  by  an  additional 
generator,  called  “down”,  and  made  subject  to  the  relation  (2).  Figure  2  shows 
a  section  of  the  resulting  index  domain,  which  clearly  exhibits  the  desired  multi¬ 
level  nature. 

We  compare  the  above-mentioned  communication  relations  in  multi-level 
methods  with  the  geometry  represented  by  this  group1:  Intra-level  nearest- 
neighbour  communications  (e.g.,  in  the  computation  of  point-wise  residuals) 
work  just  as  in  integer  grids.  Inter-level  communications  (e.g.,  in  the  compu¬ 
tation  of  prolongation  and  restriction  operators)  can  be  expressed  in  the  same 
way  by  data  shifts  by  small  distances,  using  down  or  its  inverse,  respectively. 

In  summary,  the  presented  index  domain  is  capable  of  formalizing  both  kinds 
of  locality  of  originally  different  nature.  Hence,  both  kinds  of  (translation-invari¬ 
ant)  communication  can  be  expressed  as  convolutions  by  appropriate  stencils. 
The  only  difference  is  that  the  convolution  takes  place  in  the  new  kind  of  index 
domain  and  is  defined  by  means  of  the  group  operation  “©” . 

Groups  that  model  the  geometry  of  anisotropic  (nonstandard)  coarsenings 
can  be  constructed  similarly,  but  this  is  not  carried  out  here. 


4  A  Sample  Problem  and  its  Numerical  Method 

4.1  The  Problem 

The  motivations  for  multi-level  approaches  are  (i)  faster  convergence,  and  (ii)  adap¬ 
tive  refinements,  for  a  reconciliation  of  computational  effort  and  accuracy. 

As  example  for  both  the  structured-universe  approach  and  the  new  kind  of 
index  domains,  we  present  an  adaptive  multi-grid  application.  We  consider  a 
simple  boundary-value  problem.  We  assume  as  given 

a  domain  i?  =  (a,  b )  x  (a,  b)  C  1R2 
a  function  /  :  Q  -»  1R 
a  boundary  function  F  :  8Q  — »  IR 

and  seek  as  solution 

u  :  fl  — ^  IR 

with  Lu  —  —Au  =  f  on  J?  (3) 

and  u\sn  =  F  . 

1  Recall  that  the  purpose  of  the  index  domains  in  the  structured-universe  approach 
is  to  express  the  “natural”  problem-specific  neighbourhoods  and  congruences  within 
container  data,  as  explained  in  Section  2. 
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We  assume  that  the  right-hand  side  /  possesses  a  singularity  somewhere  on  the 
boundary  612,  so  that  the  problem  calls  for  adaptive  refinement. 

4.2  The  Numerical  Method 

Data  fields  and  basic  operations.  For  an  initial  coarse  level  0  and  for  a 
finite  number  of  successively  finer  levels,  the  following  infinite-grid  quantities 
with  finite  supports  are  maintained:  “interpolated  solution”,  “solution  correc¬ 
tor”,  “residual”,  and  “right-hand-side  perturbation”;  these  names  may  appear 
abbreviated  in  equations  and  program  text.  Figure  3  sketches  the  data  structure 
and  the  data  flows  therein. 

The  residual  follows  the  other  quantities  so  that  the  following  variant  of  (3)  is 
fulfilled  (in  its  respective  discretized  form): 

L(interpol. .solution  +  soln. -corrector)  =  /  +  RHS .perturbation  +  residual  (4) 

The  solution  algorithm  will  be  constructed  from  the  following  four  basic 
operations  (larger  level  numbers  correspond  to  finer  resolutions): 

1.  Initialization  at  level  0:  At  the  coarsest  level,  the  (small)  system  of  equations 
is  solved,  and  the  solution  is  stored  into  the  field  interpolated.solution. 

2.  Interpolation  from  level  k  to  k+ 1:  A  suitable  interpolation  operator  is  applied 
to  the  sum  interpolated .solution+  solution-corrector  of  level  k,  and  the  result 
is  stored  into  the  field  interpolated.solution  of  level  k  +  1. 

3.  Smoothing  at  a  level  k:  A  smoothing  method  is  applied  to  the  residual  at 
level  k ,  and  the  resulting  correction  values  are  added  to  the  already  existing 
solution-corrector.  (The  residual  decreases  accordingly.) 

4.  Restriction  (residual  coarsening)  from  level  k  + 1  to  k:  A  restriction  operator 
is  applied  to  the  residual  at  level  *+1,  and  the  result  is  stored  into  the  field 
RHS.perturbation  of  level  k. 


Organization  of  the  basic  operations.  In  the  multi-grid  terminology,  the 
method  presented  here  is  a  full  multi-grid  (FMG)  scheme  with  V(l,l)-cycles.  It 
is  organized  as  follows:  After  the  initialization  at  level  0,  the  process  descends 
(i.e.,  interpolation  followed  by  smoothing)  to  a  certain  maximal  depth  and  then 
ascends  (i.e.,  restriction  followed  by  smoothing)  back  to  level  0.  These  descents 
and  ascents  are  continued  with  a  successively  increasing  maximal  depth  until  no 
further  refinement  is  necessary. 

For  simplicity  of  presentation,  the  algorithm  presented  here  deviates  some¬ 
what  from  the  conventional  ones  by  the  following  modifications:  We  neglect  the 
fact  that  usually  different  interpolation  operators  are  employed  in  the  FMG 
refinements  and  the  multi-grid  cycles.  Second,  the  coarse-grid  corrections  are 
calculated  for  the  perturbed  original  equation  (this  is  the  full  approximation 
scheme  used  for  non-linear  equations),  and  not  from  the  pure  defect  equation. 
Third,  the  coarse-grid  corrections  of  refined  subgrids  nevertheless  take  place  in 
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Coarse-grid  initial  solution  smoothing 


1  level  0: II  interpolated  solution  1 

soln. corrector  | 

RHS  perturbation  | — -^-jresidual | 

interpolation  ' 
H 

- 

restriction s 

smoothing 

1  level  1:11  interpolated  solution  1 

soln. corrector  | 

RHS  perturbation 

| — residual  | 

interpolation  i 
H 

t* 

X 

restriction s 

smoothing 

| level  2: ||  interpolated  solution  | 

soln.  corrector 

RHS  perturbation 

| — »|residual| 

Fig.  3.  Data  fields  and  dependences  for  the  modified  full  multi-grid  (FMG)  scheme. 
Larger  level  numbers  refer  to  finer  grids.  With  spatial  adaptivity,  some  finer  levels  may 
represent  only  subsets  of  the  problem  domain 


the  larger  subregions  pertaining  to  the  coarser  grids.  This  appears  to  be  more 
intuitive,  as  even  a  residual  with  a  limited  support  may  very  well  lead  to  a  global 
correction  of  the  solution. 

Spatial  adaptivity  consists  in  the  technique  that  increasingly  finer  resolutions 
(with  larger  computational  effort)  are  applied  only  to  increasingly  smaller  subre¬ 
gions  of  the  problem  domain,  under  control  of  some  refinement  criterion  (a  local 
discretization-error  estimator).  These  subregions  turn  out  to  be  the  neighbour¬ 
hoods  of  the  singularity  of  the  right-hand  side  /  of  (3).  (We  assume  that  / 
possesses  only  one  singularity,  so  that  the  latter  can  be  enclosed  at  each  level  by 
a  single  rectangular  subdomain.) 

FMG,  if  used  with  a  sufficiently  good  interpolation  operator,  has  the  property 
that  for  each  level  of  refinement,  an  accuracy  up  to  the  corresponding  discretiza¬ 
tion  error  of  that  level  is  achieved  already  after  a  single  multi-grid  cycle.  We 
exploit  this  property  and  consider  convergence  to  have  occurred  too  when  no 
further  local  refinement  is  required. 


5  The  Program 


5.1  Prerequisites  and  Basic  Program  Patterns 

The  program  will  be  presented  in  an  experimental  linguistic  concretization  of 
Universe  [11]  on  the  top  of  Oberon-2  [9],  Keywords  and  the  predefined  iden¬ 
tifiers  of  the  host  language  are  in  all-caps.  For  space  considerations,  only  very 
brief  explanations  are  given  here. 

An  operation  pattern  that  will  occur  frequently  in  the  program  text  is  the 
following  one 

power -type -variable[subdomain]  :  =  power -type -value  $$  power -type -value,  (5) 
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typically  such  that  one  operand  of  the  infix  operator  “$$”  (convolution)  identifies 
a  static  communication  pattern  (a  stencil) . 

By  such  statements,  some  elements  of  a  power-type  variable  are  overwritten, 
viz.  those  elements  that  are  indexed  by  the  subdomain  on  index  position,  the 
“selection  mask” .  In  a  writing  access  like  here,  masking  of  a  power-type  variable 
means  that  only  the  selected  elements  are  overwritten;  masking  of  a  power-type 
value  means  that  the  non-selected  elements  are  replaced  by  zero  in  the  result. 

The  subdomain  expressions  in  the  program  text  to  follow  might  appear  quite 
complicated  at  first  sight,  but  they  resemble  the  conventional  mathematical  no¬ 
tations  of  intervals,  element-wise  sums  of  sets,  Cartesian  products,  etc. 

The  infix  expression  on  the  right-hand  side  of  the  assignment  is  a  (discrete) 
convolution  product  (a  shortcut  for  “$*  REDUCE  BY  +  $”  if  the  element  types  of 
the  operands  are  numbers).  It  yields  a  result  with  the  same  index  domain  as 
the  operands,  and  for  all  non-zero  elements  x,-  of  the  first  operand  and  yj  of  the 
second  operand,  the  products  Xi*yj  are  accumulated  into  the  element  zi&j  of 
the  result  z. 

Convolutions  are  the  method  of  choice  for  the  (non-redundant)  expression 
of  translation-invariant  communications  (data  movements)  within  an  index  do¬ 
main.  In  all  usages  here,  one  of  the  operands  is  a  static  pattern  (a  stencil),  which 
represents  the  discretization  of  the  underlying  linear  operator. 

A  sensible  implementation  will  compute  only  those  elements  of  the  right- 
hand  side  that  are  actually  used  (e.g.,  not  masked  out).  In  order  to  facilitate 
this,  power-type  products  (e.g.,  the  convolution)  and  also  implicit  liftings  of 
scalar  pure  functions  and  operators  have  lazy  semantics  in  Universe. 

5.2  Global  Declarations 

First,  two  index  domain  are  declared,  viz.  the  two-dimensional  infinite  integer 
grid  and  the  index  domain  of  Sect.  3. 

The  INDEXCQUNTER  declaration  declares  two  symbolic  power-type  constants 
Xcoord  and  Ycoord  with  index  domain  7Z  x  2Z  and  element  type  ZZ.  These 
constants  provide  the  “canonic”  x  and  y  coordinates  of  the  integer  grid;  after  a 
multiplication  by  the  appropriate  mesh  size  they  will  be  used  to  parametrize  the 
parallel  invocations  of  the  right-hand  side  /  and  of  the  boundary  condition  F . 
For  every  index  point  that  can  be  written  as  sum  of  (1,0)  and  (0, 1)  and  their 
inverses,  the  respective  associated  symbolic  counter  indicates  how  many  of  the 
generators  are  used  to  express  that  index  point. 

The  variable  values  holds  the  data  structure  depicted  in  Fig.  3;  for  each  level, 
regions  [level]  holds  the  integral  corner  coordinates  of  the  finite  rectangular 
subgrids  that  correspond  to  ft  or  its  refining  subregions,  respectively. 

INDEXDOMAIN 

PlaneGrid  =  /Z  x  ZZ; 

MultiGrid  =  EXT (PlaneGrid,  Down,  2);  (*  see  Sect.  3  *) 

INDEXCOUNTER  Xcoord  OF  (1,0),  Ycoord  OF  (0,1); 

TYPE 

Point:  RECORD  sol,  corr,  resid,  perturb:  REAL  END; 
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RectRegion:  RECORD  la,  xz,  ya,  yz,  num:  INTEGER  END; 
VAR 

values;  [MultiGrid]<Point>; 
regions:  [^]<RectRegion>; 


5.3  The  Basic  Operations 

The  solution  of  the  small  system  of  equations  at  the  coarsest  level  (Step  1  in 
Sect.  4.2)  is  often  done  with  a  direct  solver,  and  is  not  shown  here. 

Residual  evaluation.  The  residual  is  evaluated  according  to  (4).  Besides  of 
point-wise  real  additions  and  subtractions,  the  computation  consists  of  the  eval¬ 
uation  of  the  discretized  form  of  L  and  of  the  right-hand  side  /.  The  former  is 
done  by  convolution  by  the  stencil  Lstencil,  which  is  a  power-type  constant 
with  index  domain  2Z  x  2Z .  given  below  as  a  cascaded  conditional  expression. 
The  (scalar)  function  /  is  invoked  multiple  times  (“lifted”)  with  an  explicitly 
specified  replication  space  appearing  before  it,  and  each  invocation  accesses  the 
corresponding  elements  of  the  power-type  arguments,  which  are  obtained  from 
the  symbolic  integer  coordinates  Xcoord  and  Ycoord  scaled  by  the  mesh  size  h. 

CONST  Lstencil  = 

{(0,0)}  =>  4.0  : 

{ (0,1) , (1,0) , (0,-1) , (-1,0) }  =>  -1.0; 

PROCEDURE  compResidual (level,  xa,  xz,  ya,  yz:  INTEGER); 

VAR  h:  REAL; 

BEGIN 

h  :=  1.0  /  (2**level) ; 

(*  evaluation  of  residual  according  to  (4)-'  *) 

values  [{level*dovn}©{xa+l .  .xz-l}x{ya+l .  .yz-l}]  .resid  :  = 

(values. sol+values.corr)  $$  Lstencil/ (h*h) 

-  values . perturb 

-  {level*down}  $$  [{xa+1. .xz-l}x{ya+l. .yz-l}] .f (Xcoord*h,Ycoord*h) 
END  compResidual; 


Smoothing.  Smoothing  is  done  by  red-black  relaxation ,  which  combines  good 
smoothing  properties  with  good  parallelism  properties  [16].  The  term  refers  to 
the  colouring  of  a  grid  in  a  chequerboard  pattern:  First,  all  “red  points  are 
relaxed,  which  can  be  done  in  parallel,  and  then  all  “black”  points,  again  in 
parallel  (observe  the  two  subdomains  RedGrid  and  BlackGrid  in  the  following 
program  fragment).  As  usual,  “relaxing  a  grid  point”  refers  to  the  point- wise 
error  smoothing  by  averaging  that  grid  point  with  its  neighbours,  as  determined 
by  the  stencil  involved,  with  taking  into  account  the  right-hand  side  at  the  same 
coordinates. 


-i 

(*  _i  4  -1  *) 
-1 


-  379- 


FEUP  -  Faculdade  de  Engenharia  da  Universidade  do  Porto 


SUBDOMAIN 

RedGrid  =  {  *  OF  (2,0) ,  (1,1)};  (*  spans  all  even-parity  points  *) 
BlackGrid  =  {(0,1)}  0  RedGrid;  (*  coset  of  RedGrid  *) 

PROCEDURE  smooth (level:  INTEGER); 

VAR  xa,  xz,  ya,  yz:  INTEGER;  h:  REAL; 

BEGIN 

xa  :=  regions [level] .xa;  xz  :=  regions [level] .xz; 
ya  :=  regions [level] .ya;  yz  :=  regions [level] .yz; 
h  :=  1.0  /  (2**level) ; 
compResidual (level,  xa,  xz,  ya,  yz) ; 
values  [{level*down}©RedGrid]  .corr  :  = 
values,  corr  +  values .resid*h*h/4. 0 ; 
compResidual (level ,  xa,  xz,  ya,  yz) ; 
values  [{level*dovn}©BlackGrid]  .  corr  :  = 
values. corr  +  values. resid*h*h/4.0 
END  smooth; 


Restriction.  A  customary  and  robust  method  for  restriction  (coarsening)  of 
residuals  is  “full  weighting”  [2, 6].  Every  point  of  the  coarser  grid  gets  assigned  a 
weighted  sum  of  several  nearby  points  of  the  finer  grid,  and  the  weights  are  set 
up  in  such  a  way  that  all  points  in  the  finer  grid — also  the  interleaving  ones — 
have  the  same  total  sum  of  weights,  i.e.,  the  same  “influence”  on  the  coarser 
grid. 

CONST  Restrictor  = 

{-down}  =>  4.0/16.0  : 

{(0,1)  ,  (1,0) ,  (0,-1) ,  (-1,0) }©{-down}  =>  2.0/16.0  : 

{(1,1),  (-1,1), (1,-1),  (-1  ,-l)}©{-down}  =>  1.0/16.0; 

PROCEDURE  restrictResid  (level :  INTEGER) ; 

BEGIN 

values  [{level*down}0PlaneGrid]  .perturb  :  = 
values. resid  $$  Restrictor; 

END  restrictResid; 


The  remaining  steps.  The  remaining  steps  are  explained  only  in  passing. 

The  interpolation  is  in  principle  a  linear  operator  just  like  the  restriction, 
expressed  by  convolution  by  a  stencil.  However,  two  details  have  to  be  taken 
into  account:  (i)  on  the  boundary  8fi,  the  solution  candidate  should  be  computed 
directly  from  the  given  boundary  function  F,  and  not  by  interpolation,  (ii)  Cubic 
interpolation  which  is  advisable  in  FMG  for  numerical  reasons — requires  “four 
points  in  a  row”,  but  near  boundaries  and  corners,  these  four  points  are  not 
available  in  a  symmetric  distribution,  i.e.,  two  at  either  side.  Therefore,  different 
interpolation  patterns  have  to  be  used  near  boundaries  and  corners.  Both  of 
these  detail  case  discriminations  can  be  expressed  combining  several  assignments 
like  (5)  with  different  subdomains. 
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The  procedure  SetupRectangle (level)  determines  the  corner  coordinates 
of  the  domain  rectangle  of  that  level  and  stores  them  into  the  fields  of  the  variable 
regions  [level] .  This  is  done  either  in  accordance  to  Q  —  (a,  b )2  at  those  levels 
where  the  entire  domain  Q  is  to  be  considered,  or  according  to  a  local  error 
estimator  at  those  levels  where  adaptive  refinement  is  to  be  employed.  The  field 
regions  [level]  .munis  set  to  0  iff  the  refinement  area  is  empty. 

5.4  The  Main  Program 

The  main  program  implements  the  algorithm  sketched  in  Subsect.  4.2.  The  algo¬ 
rithm  begins  at  the  coarsest  level  0  and  terminates  at  some  fine  level,  viz.  when 
the  refinement  criterion  states  that  no  further  refinement  is  necessary. 

VAR  level,  depth,  maxdepth,  i:  INTEGER; 

SetupRectangle (0) ; 

(*  initial  solution  at  level  0  (basic  operation  *1).  *) 

(*  FMG  multi-grid  iteration:  *) 
maxdepth  : =  1 ; 

LOOP 

(*  descend  down  to  level  maxdepth:  *) 
level  :=  0; 

REPEAT  INC (level); 

interpolate (level) ;  smooth(level) 

UNTIL  level  =  maxdepth; 
interpolate (level+1) ; 

SetupRectangle (level+1) ; 

IF  regions [level+ 1]  .num  =  0  THEN  EXIT  (*  from  LOOP  *)  END; 

INC (maxdepth) ;  (*  for  the  next  round  *) 

(*  ascend  back  to  level  0:  *) 

REPEAT  DEC (level) ; 

restrictResid(level) ;  smooth (level) 

UNTIL  level  =  0; 

FOR  i:=  1  TO  ...  DO  smooth(O)  END;  (*few  more  smoothings  at  lev.  0 *) 
END  (*  LOOP  *)  ; 


6  Observations 

We  summarize  and  generalize  the  key  observations  about  the  relations  between 
numeric  applications  and  high-level  programming  models: 

-  Spatial  discretizations  with  arbitrary  refinements  are  modeled  naturally 
by  countably  infinite-dimensional  vector  spaces.  Problem-specific  operators 
(e.g.,  differential,  prolongation,  and  interpolation  operators)  often  are  linear 
operators  on  these  vector  spaces. 

A  programming  model  that  models  such  applications  in  terms  of  vector 
spaces  and  linear  operators  can  be  expected  to  lead  to  compact  programs. 
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-  The  phenomenon  of  irregularity  and  dynamicity  of  spatial  structures  is 
banned  from  the  semantics — there  are  no  “irregular”  vector  spaces — and 
delegated  to  the  system. 

-  If  the  canonic  bases  for  these  vector  spaces  are  chosen  adequately,  then 
the  problem-specific  linear  operators  correspond  to  “simple”  (e.g.,  nearest- 
neighbour  and/or  translation-invariant)  communication  patterns. 

A  programming  model  that  provides  index  domains  that  reflect  the  local¬ 
ity  properties  of  the  application  can  be  expected  to  lead  to  efficiency  on 
distributed-memory  parallel  machines. 

-  In  the  case  of  geometric  multi-grid  discretizations,  the  interaction  of  grid¬ 
like  and  tree-like  geometries  in  the  same  index  domain  can  be  modeled  by  a 
group  with  appropriate  equality  relations. 

7  Comparisons 

Here  we  confine  ourselves  to  a  few  other  programming  models  that  are  related  to 
the  modeling  of  spatial  structure  of  parallel  applications.  For  a  broader  survey, 
see  for  instance  [15]. 


Other  models  with  indexable  types.  A  now  “classic”  programming  model 
that  elaborates  on  indexable  types  is  Crystal  [3].  Crystal  is  a  higher-order  func¬ 
tional  language  with  data  fields  over  generalized  index  domains,  such  as  grids, 
trees,  and  hypercubes,  and  data-field  and  index-domain  morphisms.  The  seman¬ 
tic  complexity  of  Crystal  is  considerably  higher  than  that  of  Universe. 

Groups  as  index  domains  have  also  been  proposed  for  the  programming  model 
8  J/2  [5] .  8  y2  does  identify  the  correspondence  between  generators  of  groups  and 
basic  neighbourhood  structures  (Cayley  graphs),  but  does  not  further  pursue  the 
issue  of  non- Abelian  groups  and  the  identification  of  useful  ones,  and  proposes 
their  representation  by  libraries. 


More  general  type  systems.  There  are  other  parallel  programming  models 
that  employ  inductive  types  or  even  more  general  settings  for  the  modeling  of 
spatial  structure.  As  examples  we  mention  the  Bird-Meertens  formalism  [13]  and 
NESL  [1]  for  (join-)  lists  and  Categorical  Data  Types  [14]  for  polymorphic  trees. 
A  typical  property  of  the  category-theoretic  approach  is  the  inference  of  the 
container  decompositions  from  the  type  constructors.  There  also  is  a  category- 
theoretic  understanding  of  shapes  [8]  (by  which  Universe  simply  understands 
patterns  in  structured  infinite  index  domains). 

An  abstract  generic  concept  of  capturing  parallelism  is  that  of  algorith¬ 
mic  skeletons  [15].  Programs  are  composed  from  as  few  as  possible  predefined 
parametrizable  building  blocks  (typically  a  small  set  of  second-order  functions) , 
aiming  at  implementing  parallelism  as  composition  of  pre-implemented  internally 
parallel  algorithmic  fragments.  Formally,  also  the  power-type  products  and  pro¬ 
cedure  liftings  of  Universe  constitute  such  a  small  set  of  second-order  functions 
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that  systematizes  parallel  access  patterns  for  indexable  container  types.  But  in 
contrast  to  “plain”  skeleton  concepts,  Universe  encodes  the  knowledge  about 
the  problem  geometry  also — if  not  primarily — into  index  domains  and  shapes 
of  subdomains  and  operands— perhaps  a  sort  of  “geometry  skeletons” .  The  free 
combination  and  interaction  of  these  two  concepts  makes  an  implementation  of 
Universe  a  demanding  task  and  weakens  its  simplicity  as  a  skeleton  concept. 

More  technical  approaches.  There  are  numerous  approaches  whose  philos¬ 
ophy  differs  from  the  author’s  in  that  they  consist  in  implementation  direc¬ 
tives  for  some  abstract  machine,  as  opposed  to  expressing  structural  informa¬ 
tion  about  the  applications  in  the  semantics.  To  this  class  belong  data  par¬ 
titioning/distribution  algebras,  languages,  and  systems,  also  High-Performance 
Fortran  [7].  Another  approach,  a  template  concept  for  the  modeling  of  irregular 
spatial  structures,  is  given  in  [4]. 

8  Summary  and  Conclusion 

We  have  mentioned  the  structured-universe  approach,  a  container- type  concept 
based  on  structured  infinite  index  domains.  We  have  mentioned  the  known  fact 
that  groups  as  index  domains  are  general  enough  to  host  grids  as  well  as  trees, 
and  to  formalize  their  different  geometries  under  a  unified  scheme.  We  have  ex¬ 
ploited  this  generality  of  groups  as  index  domains  further  and  have  introduced  a 
new  kind  of  groups  to  host  multi-grid  algorithms.  These  groups  reflect  the  multi¬ 
level  nature  in  that  grid-like  and  tree-like  neighbourhoods  interact  in  the  same 
index  domain.  We  have  related  this  phenomenon  to  commutativity  properties. 

This  result  sheds  some  more  light  on  the  little  recognized  versatility  of  (possi¬ 
bly  non- Abelian)  groups  as  spatial  domains.  Originally  conceived  as  a  unification 
of  two  different  kinds  of  spatial  structure,  they  generalize  further  to  an  “inter¬ 
polation”  between  these  two.  Together  with  the  “structured-universe  approach 
— an  abstraction  scheme  reminiscent  of  infinite-dimensional  vector  spaces  over 
geometrically  structured  index  domains — this  new  kind  of  index  domains  pro¬ 
vides  an  expressive  formalization  framework  for  adaptive  multi-grid  algorithms. 
Such  formalizations  are  a  prerequisite  for  the  high-level  programming  of  dis¬ 
tributed-memory  machines  by  compact  programs,  and  may  constitute  the  input 
for  an  efficient  automatic  mapping  onto  such  machines. 
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Abstract.  In  this  paper  we  will  present  the  main  basis  of  paradeis 
which  extends  the  features  of  the  STL  library  to  deal  with  data-parallelism 
and  sparse  matrix  computation, 
topics:  Languages  and  Tools,  Numerical  methods 

1  Introduction 

Sparse  matrix  computation  is  recognized  to  be  ubiquitous  in  computa¬ 
tional  science  but  the  parallel  programs  are  error-prone,  hard  to  debug, 
and  difficult  to  maintain.  Consequently,  the  simplification  of  these  par¬ 
allel  programs  represents  an  emerging  trend. 

In  this  context,  several  issues  can  be  considered.  The  first  issue  puts 
the  emphasis  to  a  compilation  based  approach.  Sparse  compiler  auto¬ 
matically  restructures  a  dense  program  dealing  with  arrays  to  a  sparse 
program  dealing  with  sparse  arrays  [3],  [5]. 

The  second  issue  is  to  consider  a  run-time  support  based  on  a  numerical 
library  which  offers  a  set  of  numerical  programs  in  order  to  cover  the 
main  features  of  sparse  matrix  algorithms. 

At  the  cross-roads  of  this  two  issues,  we  propose  a  run-time  support  which 
provides  a  set  of  basic  operations  to  deal  with  sparse  matrix  structure. 
The  goal  is  to  define  an  abstract  data  structure  which  eases  the  parallel 
programming  of  sparse  matrices.  It  aims  at  extending  the  STL  library 
according  to  this  scope.  One  of  the  goal  of  paradeis  is  to  be  used  as  a 
user-library  as  well  as  a  back-end  of  parallel  sparse  compiler.  The  exten¬ 
sions  are  mainly  focused  on  two  topics  :  data-parallelism  and  sparsity 
management. 

The  extended  abstract  is  organized  as  follows  :  the  Section  2  describes 
the  main  features  of  paradeis.  The  Section  3  briefly  outlines  the  descrip¬ 
tion  of  the  parallel  sparse  structure.  The  Section  4  shows  experiment  on 
efficiency  of  running  paradeis  programs.  Finally,  the  Section  5  will  con¬ 
clude  this  paper  by  a  discussion. 

2  STL  Extension 

The  Standard  Template  Library  provides  a  set  of  well  structured  generic 
C++  components  that  work  together  in  a  seamless  way.  The  library  con- 
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tains  three  main  components  :  container  which  manages  set  of  memory 
location,  iterator  which  provides  a  means  for  an  algorithm  to  traverse 
through  a  container  and  algorithms  which  are  computational  procedures. 
The  extensions  concern  the  two  first  components,  namely  container  and 
iterators.  They  cover  the  parallelism  and  the  sparsity  management.  In 
addition,  we  define  a  collective  communication  primitive. 


2.1  Sparse  Computation  Extension 

This  section  will  concern  the  extensions  for  sparse  computation,  they 
aim  at  hiding  the  sparsity  to  users  to  provide  a  simple  programming 
framework.  The  sparsity  management  is  dealt  by  methods  which  operate 
on  a  dedicated  container  named  SparseArray .  Throughout  this  paper, 
we  will  examplified  the  features  with  the  algorithm  of  vector  addition 
R=V i+V2  where  the  vectors  V 1  and  V 2  are  sparse  and  R  is  dense. 
The  SparseArray  container  provides  an  homogeneous  framework  for 
dense  and  sparse  array  without  loss  of  performances  for  dense  array 
computation.  So,  the  declaration  will  be 
SparseArray  VI (100),  V2(100),  R(100); 

Numerous  storage  formats  have  been  proposed  in  sparse-matrix  litera¬ 
ture,  for  our  work,  we  have  generalized  the  Block  storage  format  to  a 
Distributed  Block  storage  format.  But,  this  representation  is  hidden  to 
programmers  so  the  program  is  generic.  Thus,  the  internal  storage  format 
can  be  changed  without  any  modification  of  the  user-program. 


Iterators  Iterators  are  a  generalization  of  pointers  that  allow  a  pro¬ 
grammer  to  work  with  different  data  structures  in  a  uniform  manner. 
The  iterators  are  divided  into  several  classes.  Each  class  corresponds  to 
the  capabilities  of  an  iterator.  The  class  of  forward  iterators  scans  all  the 
structure.  In  paradeis,  the  sparse  forward  iterator  will  scan  only  the 
non-zero  value.  A  general  scheme  of  an  assignment  restricted  to  entries- 
structure  follows  a  dense  programming  style  : 

Iterator  i(R); 

for (i.beginO  ;  i.endO;  i.nextQ) 

R[i]  =  ...; 

Another  mechanism  is  added  to  complete  the  requirement  of  sparse  com¬ 
putation.  This  mechanism  synchronizes  the  iterators  to  a  selected  struc¬ 
ture  of  entries  of  a  sparse  data-structure.  For  example,  if  we  consider, 
an  addition  of  two  sparse  vectors  R  =  Vi  +  V2\  the  i * >l  components  of 
R  will  be  the  result  of  the  addition  of  the  ith  components  of  VT  and 
V2.  But,  in  the  sparse  representation,  the  ith  component  of  each  vector 
are  not  located  in  the  same  place.  Moreover,  one  of  the  component  may 
not  exist  at  all.  The  synchronized  iterators  will  have  the  same  behavior 
as  the  dense  computation  in  vectors  addition.  Assuming  that  the  entries 
structure  of  R  is  properly  declared  as  the  union  of  the  two  entries  struc¬ 
tures  Vi  and  V 2,  the  iterators  scanning  V 1  and  V 2  are  synchronized  to 
the  space  of  the  entries  of  R.  A  comparison  is  performed  to  determine  if 
the  current  pointed  value  of  V 1  and  V 2  represent  the  same  entries.  If  it 
is  the  case  then  the  addition  is  performed.  If  one  of  the  entries  is  missing 
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the  corresponding  value  is  0.  Except  for  the  declaration  of  the  iterators, 
first  the  management  of  these  rules  is  hidden  to  the  programmer  and 
second  the  program  is  close  to  the  dense  program  : 

Iterator  i(R),  il(Vl,i),  i2(V2,i); 
for(i. beginO  ,  il.beginO,  i2 .beginO  ;  i.endO; 
i.nextO,  il.nextO,  i2.next()) 

R[i]  =  VI  [il]  +  V2[i2]  ; 

Entries  Set  Operations  In  conjunction  with  synchronized  iterators,  a 
set  of  primitives  is  provided  to  operate  on  the  structure  and  not  on  the 
values.  It  will  offer  the  capabilities  to  symbolically  determine  the  shape 
of  the  expected  structure  at  the  end  of  the  numerical  computation. 

It  has  been  shown  in  [1]  that  a  correspondence  can  be  achieved  between 
arithmetic  operators  and  set /logical  operators.  For  the  vectors  addition, 
the  sparse  structure  of  the  entries  of  R  is  defined  as  the  union  of  the 
entries  of  Vi  and  W  Then,  this  structure  is  used  as  a  reference  pattern 
for  synchronizations  of  the  iterators. 

SparseArray  R(union(Vl,  V2) ) ;  //  R  is  assumed  sparse  here 

More  complex  algorithms  can  be  used  to  determine  the  structure  of  the 
entries  as  the  symbolic  factorization  in  the  sparse  Cholesky  factorization. 
But  these  algorithms  can  be  expressed  by  set  and  logical  operations.  In 
this  context,  paradeis  provides  basic  functions  to  symbolically  compute 
the  fill-in  introduced  during  the  computation. 

2.2  Data-parallel  Extension 

In  this  section,  we  describe  the  extension  of  iterators  for  parallel  sparse- 
matrix  computation.  Before  explaining  its  semantic,  we  will  introduce 
the  context  of  the  execution.  Programs  are  written  in  a  SPMD  style. 
The  target  architecture  is  a  distributed  memory  architecture  or  a  cluster 
of  workstations.  The  sparse  matrix  is  folded  to  processors.  And  in  each 
processor,  the  computation  is  applied  to  the  local  part  of  the  data. 

Sparse  Parallel  Iterators  The  parallel  execution  will  be  expressed  by 
an  iterator  named  DoAllIterator.  Its  semantic  will  guarantee  that  every 
elements  in  the  sparse  matrix  will  be  scanned,  but  in  any  order.  Its 
semantics  is  derived  from  the  standard  conditions  of  parallelization  which 
aims  at  relaxing  constraints  on  the  sequential  execution  order. 
Pragmatically,  the  iterator  scans  only  the  local  part  of  the  data  on  each 
processor.  The  conversion  from  a  global  to  local  address  is  handled  by  an 
iterator’s  method.  Under  the  assumption  that  vectors  are  appropriately 
distributed,  the  parallel  sparse  program  of  vectors  addition  is  : 

SparseArray  Vl(lOO),  V2(100),  R(100) ;  //  R  is  dense 
//  Initialization  of  VI  and  V2 
DoAllIterator  i(R) ,  il(Vl,i),  i2(V2,i); 
for  (i  .beginO  ,  il.beginO,  i2. beginO;  i.endO; 
i.nextO,  il.nextO,  i2.next()) 

R[i]  =  Vl[il]  +  V2[i2]  ; 
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Communication  Primitives  The  communication  primitives  are  neces¬ 
sary  to  exchange  values  between  processors.  Since  paradeis  is  based 
on  a  run-time  process,  no  compiler  automatically  transforms  accesses  to 
global  memory  to  communication.  So  communications  are  explicit. 

In  [4],  which  is  focussed  on  the  communication  support  of  the  exchange, 
we  provide  more  details  about  the  communication  scheme. 

The  design  of  the  program  follows  the  rules  of  the  BSP  model  [6]  :  The 
communication  phases  are  separated  from  the  execution  phases. 

But,  in  order  to  simplify  the  programmer  task,  the  communications  prim¬ 
itives  correspond  to  a  global  communication.  So  the  program  is  the  same 
for  the  emitter  and  the  receiver  and  communications  defined  according 
to  a  global  address  space.  In  some  extend,  it  may  correspond  to  an  align¬ 
ment  of  values. 

For  instance,  given  the  following  assignment  in  Fortran  90,  X  ( 1 : 100) 
=  Y(2 : 101) ,  if  we  assume  that  the  array  X  and  the  array  Y  have  the 
same  distribution,  a  communication  must  be  achieved.  In  paradeis,  this 
communication  is  defined  by  the  exchange  primitive  as  follows  : 

Y . exchange (X,  SectionCl, 100) ,  Section(2,101)) ; 

The  communication  primitive  manages  the  sparsity  as  well  as  the  dis¬ 
tribution.  For  sparse  arrays,  every  values  of  Y  contained  in  the  interval 
2  :  101  are  exchanged.  The  exchange  is  “aligned”  to  the  sparse  structure 
of  X.  The  exchange  primitive  also  broadcasts  values.  For  instance,  given 
Fortran  90  statements  X(  1:100, 1:100)  =  Y(1 : 100),  the  communication 
will  be  expressed  as  follows  : 

Y. exchange (X,  Block(Section(l , 100) ,  SectionCl , 100) ) , 
Block(Section(l,100) ,  EXTEND))); 

The  EXTEND  keyword  specifies  that  Y  must  be  extended  in  dimension 
before  the  exchange  is  performed  .  The  extension  in  dimension  is  virtual 
and  it  does  not  waste  memory  space. 

The  exchange  has  been  performed  by  an  inspector-executor  scheme. 
Then,  the  vector  addition  with  an  inappropriate  distribution  is  : 

SparseArray  Vl(100),  V2(100),  R(100);  //  R  is  dense 
DoAllIterator  i(R),  il(Vl,i),  i2(V2,i); 

VI. exchange (R,  SectionCl , 100) ,  SectionCl , 100) ) ; 

V2. exchange (R,  SectionCl, 100) ,  SectionCl ,100) ) ; 

forCi.beginO ,  il.beginO,  i2.beginC);  i.endO; 
i .next () ,  il . next () ,  i2.next()) 

R[i]  =  VI  [il]  +  V2[i2]; 

Distribution  The  distribution  that  we  consider  is  called  a  user-defined 
partitioning  since  it  lets  the  users  define  their  own  distribution.  Given 
a  two-dimensional  array  A,  the  distribution  will  correspond  to  a  3-tuple 
{i,j,p).  It  means  that  A(z',y)  is  different  from  zero  and  it  is  distributed 
to  the  processor  ;>  This  description  is  considered  as  an  interface  between 
the  distribution  and  its  internal  representation. 
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3  Descriptor 

In  paradeis,  the  description  of  the  distribution  is  based  on  a  conservative 
approximation  because  the  information  usually  can  t  be  replicated  in 
each  processor  due  to  the  amount  of  information  which  cannot  be  held 
in  a  processor  memory.  Hence,  the  approximation  reduces  the  amount  of 
information.  A  valid  approximation  is  such  that,  given  a  non-zero  value 
located  at  the  coordinates  if  the  value  at  ( i,j )  is  distributed  to 

p  then  the  approximation  also  gives  this  information.  This  description 
will  guarantee  that  a  communication  requires  only  one  message  [4].  No 
supplementary  message  is  necessary  to  find  the  location  of  a  data.  A 
sparse  tree  descriptor  describes  this  approximation.  According  to  this 
scheme,  the  mapping  is  described  by  a  set  of  3-tuples  ([1*  :  Ux],  [ly  '■  Uy],p) 
signifying  that  the  data  items  contained  in  this  block  are  mapped  to 
the  processor  p.  Logically,  the  descriptor  can  be  viewed  as  a  tree.  The 
descriptor  shares  some  of  its  motivation  (the  approximation  scheme  for 
example)  with  the  R-trees  data  structure  frequently  used  in  geographical 
database. 


Fig.  1.  Example  of  a  descriptor  codification 


-  The  root  is  the  size  of  the  matrix. 

-  The  first  level  is  the  common  global  knowledge  shared  by  every  pro¬ 
cessor.  It  provides  an  approximation  of  the  exact  mapping 

—  The  second  level  corresponds  to  local  knowledge.  It  is  the  exact  de¬ 
scription  of  the  values  held  by  a  processor.  This  information  is  only 
stored  on  the  processor  where  the  corresponding  data  are  mapped. 

—  The  leaves  contain  data  if  they  are  resident  on  the  processor. 

Figure  1  describes  an  example  of  this  codification  for  a  10  x  10  matrix. 
Filled  cells  correspond  to  significant  values  whereas  empty  cells  corre¬ 
spond  to  zeros. 
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Any  partitioning  can  be  applied  providing  it  is  described  by  a  sparse 
array  descriptor.  For  matrices,  some  partitioning  algorithms  such  as  BRD 
decomposition  [2]  provide  a  natural  description  of  partition  as  a  sparse 
tree  descriptor.  If  the  result  cannot  be  directly  codified  by  blocks,  then 
it  is  decomposed  into  several  independent  blocks.  These  blocks  are  not 
necessarily  contiguous  but  they  still  refer  to  a  single  partition.  Once 
the  mapping  is  defined  the  pieces  of  the  sparse  matrix  are  assigned  to 
processors.  The  codification  is  performed  in  parallel.  The  part  of  the 
global  view  computed  for  one  processor  is  broadcast  to  every  processor. 


4  Experiments 

In  this  section,  we  present  experimental  results  on  paradeis.  The  results 
of  execution  time  are  performed  on  the  matrix  vector  product  which  is 
the  core  of  numerous  numerical  algebra  algorithms.  In  this  program,  the 
matrix  is  sparse  and  the  two  vector  are  dense.  Experiment  were  running 
on  an  8  nodes  IBM  SP2  (Power  PC  66.6  Mhz,  256  Mb/processor).  The 
tests  have  been  running  on  several  Harwell  Boeing  colllection  matrices. 
Only  the  more  interesting  results  are  presented  here. 

This  experiments  on  paradeis  are  meant  to  evaluate  the  overhead  in 
computational  time  and  in  memory  occupancy.  The  overhead  in  runtime 
execution  is  divided  into  two  part,  a  constant  time  on  each  processor  and 
a  varying  time  depending  on  the  number  of  processors.  This  varying  time 
depict  the  penalty  of  the  parallel  execution  management.  The  scalability 
experiment  gives  the  execution  time  on  multiple  processors.  It  allows  to 
compute  the  ratio  of  computing  over  idle  time  thus  giving  a  measure  of 
the  varying  time.  The  constant  time  comes  from  a  software  layer  man¬ 
aging  the  iterators.  A  comparison  with  an  equivalent  sequential  program 
(here  the  CRS  matrix  vector  product)  gives  a  measure  of  this  overhead. 
The  memory  overhead  is  compared  to  a  sparse  data  storage  (CRS)  to 
show  the  impact  of  the  parallel  description  and  of  the  density  ratio.  This 
ratio  is  used  to  build  the  leaves  of  the  data  structure  by  giving  the  num¬ 
ber  of  zero  that  may  be  stored  with  non  zero.  Reducing  this  ratio  stores 
more  zeros  but  it  reduces  the  descriptor  size  and  vice  versa. 

The  main  properties  of  some  of  the  N  x  N  tested  matrices  are  shown 
in  the  Table  2.  They  are  NZ  the  number  of  non  zero,  N  the  size  of  the 
matrix  and  its  density  (how  much  spase  it  is). 


bcsstkl3 

cavity09 

mcfe 

bcsstml3 

bcsstkl9 

NZ 

42943 

32747 

24382 

11973 

3835 

N 

2003 

1182 

765 

2003 

817 

Density 

1.07  % 

2.34  % 

4.16  % 

0.3  % 

0.57  % 

Fig.  2.  Properties  of  the  tested  matrices 


-  390  - 


V EC  PAR  '2000  -  4th  International  Meeting  on  Vector  and  Parallel  Processing 


Scalability  Figure  3  shows  scalability  and  correpsonding  efficiency  of 
the  matrix  vector  product  with  paradeis.  The  minimum  efficiency  is 
75  %,  which  shows  that  the  management  of  the  parallel  descriptor  has 
little  impact  on  execution  time.  The  decreasing  of  the  scalability  is  cre¬ 
ated  mainly  by  communications. 


100 


Fig.  3.  Scalability  and  efficiency  of  the  matrix  vector  product 


Language  Overhead  In  order  to  obtain  the  overhead  in  time  induced 
by  our  iterator  access  and  management,  we  evaluate  the  execution  time 
of  the  matrix  vector  product  writing  in  paradeis  and  the  CRS  storage 
version.  From  matrix  to  matrix,  the  ratio  between  execution  time  r  is 
in  the  range  1.9  <  r  <  3  on  the  RS6000  processor  in  favor  of  the  CRS 
program.  The  loss  compared  to  this  specialized  program  comes  from  an 
increase  of  bound  checking. 

Memory  Overhead  The  memory  occupancy  depends  on  two  param¬ 
eters:  the  chosen  density  ratio  and  the  matrix  structure.  Figure  4  present 
two  measures  showing  those  two  sides  of  the  memory  part  in  paradeis. 
The  first  one  on  the  left  shows  the  evolution  of  memory  occupancy  in 
bytes  when  the  density  ratio  evolve.  It  is  compared  to  a  distributed  ver¬ 
sion  of  the  CRS  storage  (MRD  Multiple  Recursive  Decomposition).  This 
first  measure  show  that  memory  occupancy  is  always  close  to  the  MRD 
and  that  its  evolution  with  the  density  ratio  depends  on  matrix  struc¬ 
ture.  The  second  measure  on  the  right  present  a  detailed  overview  of 
memory  occupancy  distribution  between  the  descriptor  and  the  data. 
It  shows  that  the  global  descriptor  part  is  neglibible  and  that  the  local 
part  depends  on  the  matrix  structure,  growing  faster  when  the  values 
are  more  scattered. 

This  experiment  shows  that  paradeis  can  be  compared  to  a  dedicated 
sparse  matrix  vector  program  and  that  it  uses  a  moderated  memory 
occupancy  in  the  SparseArray  data  structure  providing  powerfull  dis¬ 
tribution  and  communication  scheme,  paradeis  is  a  trade-off  between 
expressiveness  and  efficiency  for  sparse  linear  algebra  program. 


5  Discussion  and  Conclusion 

We  consider  paradeis  as  a  compiler  back-end  as  well  as  a  user  library. 
Thus,  it  provides  some  features  to  interact  with  the  higher  levels  of  a 
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Fig.  4.  Memory  occupancy 


sparse  and/or  a  data-parallel  compiler,  paradeis  addresses  a  class  of 
sparse  linear  algebra  problems  with  user-defined  partitioning. 

It  is  based  on  a  run-time  support  which  offers  a  scalable  and  portable 
framework  for  data-parallel  programs  which  operate  on  sparse  matrix. 
The  portability  is  due  to  the  language  which  has  been  installed  in  a 
large  number  of  platforms  since  paradeis  has  been  written  in  C++  (9000 
lines)  and  the  communication  library  is  PVM.  The  scalability  relies  on 
the  primitives.  The  conversion  of  global  to  local  address  is  handled  by 
methods  and  the  collective  exchange  based  on  a  global  address  space.  It 
leads  to  a  scalable  framework  since  the  user  (or  the  compiler)  has  not 
to  determine  the  exact  contain  of  a  message  according  to  the  amount 
of  data  folded  in  a  processor.  The  mapping  phase  and  the  inspector- 
executor  phase  are  distinguished.  We  think  that  it  offers  a  modularity  in 
the  development  since  the  both  components  can  be  improved  or  changed 
independently. 
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Abstract.  In  this  paper  we  study  the  design  of  installation  routines 
for  linear  algebra  routines  on  networks  of  processors.  The  main  idea 
is  to  develop  those  installation  routines  in  such  a  way  that  they  allow 
inexpert  users  to  execute  the  parallel  linear  algebra  routines  with  an 
optimum  number  of  processors  and  distribution  of  data.  The  designing 
methodology  of  these  routines  has  been  analyzed  for  homogeneous  and 
heterogeneous  networks,  and  the  experimental  results  obtained  with  a 
gaussian  elimination  routine  are  shown. 


1  Introduction 

In  the  last  years  parallel/distributed  computing  has  become  widely  popular  due 
in  part  to  the  possibility  of  using  some  processors  connected  by  a  communication 
network  as  a  parallel  system.  In  that  way  the  use  of  parallel  systems  has  become 
cheaper  and  easier,  and  new  users  are  begining  to  use  parallel  programming.  In 
particular,  users  with  great  computational  necessities  (scientists  and  engineers) 
now  have  the  possibility  of  solving  their  problems  using  the  machines  they  ha\e 
access  to,  connecting  them  by  means  of  some  communication  network  (ethernet, 
fastethernet,  myrinet,  ...)  and  using  message-passing  parallelism,  without  high 
additional  cost.  But  the  design  of  message-passing  parallel  programs  is  not  an 
easy  task,  specially  for  inexpert  users.  The  problems  these  users  need  to  solve 
are  in  many  cases  linear  algebra  problems:  solution  of  linear  systems,  or  eigen- 
problems.  This  is  why  we  are  working  on  the  design  of  linear  algebra  routines 
especially  for  LANs  (Local  Area  Networks)  [1]. 

There  is  other  research  in  which  the  design  of  linear  algebra  routines  for 
heterogeneous  networks  of  processors  is  analysed  [2-4].  In  some  cases  the  distri¬ 
bution  of  data  in  the  system  is  obtained  dynamically  [2],  and  statically  in  others 
[3],  Our  goal  is  to  statically  obtain  data  distributions  close  to  the  optimum. 

*  Partially  supported  by  Comision  Interministerial  de  Ciencia  y  Tecnologia,  project 
TIC96-1062-C03-02. 
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Since  the  users  we  are  thinking  of  are  not  experts  in  parallel  programming,  one 
possibility  is  to  develop  installation  routines  for  the  linear  algebra  routines,  in 
such  a  way  that  the  user  can  execute  a  linear  algebra  routine  in  a  processor.  This 
routine  consults  information  generated  by  the  installation  routine  and  decides 
the  number  of  processors  to  use  and  the  best  data  distribution  to  obtain  the 
lowest  execution  time. 

Each  linear  algebra  routine  has  an  associated  installation  routine  which  ob¬ 
tains  approximate  optimum  values  of  the  number  of  processors  and  the  block  size 
in  which  the  matrices  are  divided.  The  installation  routines  are  executed  during 
the  installation  of  the  linear  algebra  library,  but  they  can  be  re-executed  each 
time  the  conditions  of  the  system  change,  i.  e.  more  memory  in  some  machines, 
modifications  in  the  management  of  the  file  system  or  addition  or  elimination  of 
a  processor. 

As  the  system  can  be  formed  by  processors  with  different  capacities,  it  may  be 
preferable  to  develop  linear  algebra  routines  and  the  corresponding  installation 
routines  for  heterogeneous  systems. 

In  this  paper  we  analyse  the  methodology  for  the  design  of  these  installa¬ 
tion  routines  for  homogeneous  and  heterogeneous  LANs.  Experimental  results 
obtained  with  a  gaussian  elimination  routine  and  with  variations  in  the  hetero¬ 
geneity  of  the  network  are  shown. 

2  Installation  routines 

A  gaussian  elimination  routine  has  been  used  to  study  the  methodology  of  the 
design  of  installation  routines  and  the  behaviour  of  these  routines  in  systems 
with  different  characteristics. 

The  matrix  is  considered  as  divided  in  blocks  of  adjacent  rows,  and  the  blocks 
are  assigned  to  the  processors  using  a  rowwise  block-cyclic-striped  mapping  [5], 

If  the  system  is  homogeneous  the  blocks  can  be  all  the  same  size  (b).  When  the 
block  size  increases  the  imbalance  increases,  but  the  number  of  communications 
decreases.  Thus,  for  a  given  matrix  size  (n),  the  installation  routine  must  obtain 
the  number  of  processors  (p)  and  the  block  size,  with  which  the  lowest  execution 
time  is  obtained. 

In  the  case  of  a  heterogeneous  system,  the  block  size  is  not  the  same  for 
the  different  processors,  and  the  blocks  assigned  to  processors  with  higher  com¬ 
putational  capacity  would  be  larger.  The  installation  routine  also  obtains  the 
optimum  number  of  processors  and  block  size,  but  in  the  linear  algebra  routine 
the  size  of  the  blocks  in  each  processor  is  obtained  by  the  formula: 

bi  =  Y*'  v  pb  (1) 

Z^i=i  ui 

where  vt  is  the  speed  of  processor  i  (i.  e.  in  Mflops)  and  bt  the  size  of  blocks  as¬ 
signed  to  processor  i.  This  is  the  way  in  which  other  software  for  heterogeneous 
computing  works  [2]  when  dynamically  deciding  the  data  distribution  in  the  sys¬ 
tem.  In  our  case  the  assignment  is  not  done  dynamically  since  to  dynamically 
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obtain  the  optimum  number  of  processors  and  block  size  would  be  too  costly. 
What  could  be  done  in  our  case  is  to  dynamically  obtain  the  values  of  vt  by 
executing  in  each  processor  some  basic  matrix  operation  (i.  e.  a  matrix  multi¬ 
plication).  but  we  have  preferred  to  leave  this  work  to  the  installation  routine 
because  we  consider  LAN  systems  are  normally  well  controlled.  In  any  case,  the 
modification  to  obtain  the  block  sizes  6;  dinamically  is  only  a  small  one,  and  it 
could  be  done  in  the  linear  algebra  routine. 

The  installation  routine  could  work  by  obtaining  the  values  of  p  and  b  from 
a  formula  of  the  theoretical  execution  time  (depending  on  the  linear  algebra 
routine)  and  the  values  of  the  cost  of  an  arithmetic  operation,  of  the  start¬ 
up  time  and  word-sending  time  obtained  experimentally  for  the  system.  Some 
problems  remain:  to  obtain  the  theoretical  execution  time  we  normally  make 
some  assumptions  which  are  valid  when  the  matrix  size  increases,  but  not  with 
small  matrices,  and  it  is  more  difficult  to  predict  the  experimental  results  in 
LANs  than  in  multicomputers,  due  to  the  characteristics  of  the  communication 
network. 

Another  possibility  is  to  obtain  the  optimum  number  of  processors  and  block 
size  bv  performing  a  number  of  executions.  In  that  case  the  system  manager  de¬ 
cides  the  minimum  and  maximum  matrix  size  and  the  increment  of  the  matrix 
size.  Experiments  are  performed  for  these  matrix  sizes,  but  to  perform  the  ex¬ 
periments  for  all  the  possible  numbers  of  processors  and  block  sizes  could  be  too 
expensive,  and  what  the  routine  does  is  to  obtain  the  number  of  processors  and 
block  size  for  the  smallest  matrix  and  it  uses  the  values  obtained  for  a  matrix 
size  as  initial  values  for  the  next  matrix  size.  Since  the  optimum  values  of  p  and 
b  vary  in  a  continuous  manner  with  the  matrix  size,  the  cost  of  the  installation 
routine  thus  becomes  acceptable. 

In  the  homogeneous  case,  a  file  with  the  matrix  sizes  and  the  associated 
number  of  processors  and  block  size  is  generated,  and  the  linear  algebra  routine 
consults  this  file  for  the  entry  with  the  matrix  size  nearest  to  the  actual  matrix 
size  in  order  to  decide  the  number  of  processors  and  the  distribution  of  the 
matrix  in  the  execution. 

In  the  heterogeneous  case,  that  file  is  also  generated  along  with  an  additional 
file  with  the  proportional  speeds  of  the  processors,  classified  from  fastest  to 
slowest.  The  linear  algebra  routine  takes  the  number  of  processors  from  the  first 
file,  the  processors  to  use  in  the  execution  from  the  second  file,  and  the  block 
sizes  are  obtained  using  formula  1  with  the  value  b  obtained  from  the  first  file 
and  the  values  i>;  from  the  second  file. 


3  Experimental  results 

In  this  section  the  experimental  results  obtained  using  different  installation  rou¬ 
tines  for  a  gaussian  elimination  routine  are  shown.  The  entries  of  the  matrices 
have  been  randomly  generated,  and  a  network  of  SUN  Ultra  workstations  con¬ 
nected  by  ethernet  lm>  been  used.  Three  networks  are  considered. 
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-  A  network  of  five  SUN  Ultra  1  with  the  same  computational  capacity,  but 
with  one  of  them  managing  the  file  system,  and  consequently  this  machine 
works  more  slowly.  This  is  the  more  homogeneous  system  of  the  three,  and 
it  will  be  called  HOM. 

-  A  network  obtained  by  adding  a  SUN  Ultra  5  to  HOM,  which  is  quicker  than 
the  other  processors.  This  network  will  be  called  HIB. 

-  A  network  with  three  processors:  the  SUN  Ultra  1  which  manages  the  file 
system,  another  SUN  Ultra  1,  and  the  SUN  Ultra  5.  Since  this  network  can 
be  considered  the  more  heterogeneous,  it  will  be  called  HET. 

In  figures  1,  2  and  3  the  results  obtained  with  the  different  installation  rou¬ 
tines  are  compared  with  those  obtained  experimentally.  The  figures  show  the 
quotients  of  the  execution  time  obtained  with  the  optimum  number  of  processors 
and  block  size  given  by  the  different  installation  routines  and  the  optimum  execu¬ 
tion  time  obtained  experimentally.  Figure  1  shows  the  quotient  for  HOM,  figure  2 
for  HIB,  and  figure  3  for  HET.  The  installation  routines  used  are:  HOMEXP,  the 
homogeneous-experimental  routine;  HETEXP,  the  heterogeneous-experimental; 
and  THEOR,  the  theoretical  routine.  THEOR  is  used  in  HOM  to  obtain  the 
theoretical  optimum  number  of  processors  and  block  size  to  use  in  each  proces¬ 
sor  and  in  HIB  and  HET  the  same  number  of  processors  is  used,  but  the  block 
sizes  are  different  in  the  different  processors. 


500  1000  1500  2000  2500  3000 


Fig.  1.  Comparison  between  the  different  installation  routines  in  HOM.  Quotient  of 
the  execution  time  obtained  with  the  values  of  p  and  b  provided  by  the  installation 
routines  and  the  best  execution  time. 


Some  considerations  can  be  made: 

—  The  theoretical  routine  does  not  predict  the  number  of  processors  and  matrix 
distribution  well  for  small  matrices  but  when  the  matrix  size  increases  the 
prediction  is  better.  In  some  cases  (HET)  the  theoretical  prediction  is  as  good 
as  the  experimental  prediction.  Theoretical  prediction  could  be  performed 
for  big  matrix  sizes,  since  in  that  case  the  experimental  routines  are  too 
expensive. 
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-  The  homogeneous-experimental  routine  works  well  when  the  network  is  close 
to  a  homogeneous  network,  but  the  heterogeneous-experimental  routine  is 
preferable  when  the  heterogeneity  increases. 

—  The  heterogeneus-experimental  routine  works  well  in  all  the  systems,  so  this 
type  of  routine  is  the  best  as  installation  routine. 


Fig.  2.  Comparison  between  the  different  installation  routines  in  HIB.  Quotient  of  the 
execution  time  obtained  with  the  values  of  p  and  b  provided  by  the  installation  routines 
and  the  best  execution  time. 
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Fig.  3.  Comparison  between  the  different  installation  routines  in  HET.  Quotient  of  the 
execution  time  obtained  with  the  values  of  p  and  b  provided  by  the  installation  routines 
and  the  best  execution  time. 


4  Conclusions 

We  have  shown  a  methodology  to  design  installation  routines  for  linear  algebra 
libraries  for  non  parallel  programmers  in  LANs.  These  routines  can  be  used  to 
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provide  the  users  with  the  number  of  processors  and  the  block  sizes  for  solving 

the  problems  in  a  time  close  to  the  optimum  time. 

The  experiments  performed  show  satisfactory  outcomes. 

Our  idea  is  to  design  a  linear  algebra  library  using  installation  routines  of 

the  type  analysed  in  this  paper. 
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Abstract.  The  considered  task  is  building  a  cellular  automaton,  such 
that  an  array  from  automata  of  this  type  with  arbitrary  unilateral  biva¬ 
lent  connection  graph  can  solve  the  same  problem  as  a  bilateral  linear 
cellular  automata  array.  It  is  presumed  that  the  complexity  of  the  cellular 
automaton  does  not  depend  on  the  number  of  the  automata  in  the  array 
and,  maybe,  depends  in  some  regular  way  on  the  rank  of  the  respective 
graph  vertex. 


1  Introduction 

First  of  all.  we  will  try  to  give  a  more  or  less  precise  definition  of  functional 
equivalence  as  we  treat  it  in  this  paper.1  Of  course,  this  definition  directly  de¬ 
pends  on  the  way  we  treat  function  realized  by  the  array,  problem  solved  by  the 
array  and  behavior  of  the  array.  Different  treatment  of  these  terms  will  lead  to 
different  results  and  different  interpretation  of  them.  When  speaking  of  problem 
solved  by  the  array,  we  will  follow  the  classical  examples  from  the  works  b>  Hen- 
nie  [1],  Fisher  [2],  Myhill  [3],  etc.  We  are  going  to  deal  with  problems  for  which 
the  formulation  is  invariant  to  the  array  size  (number  of  automata),  i.e.  the  au¬ 
tomaton  complexity  (number  of  internal  states)  does  not  depend  on  the  array 
size.  The  typical  examples  of  such  problems  for  one-dimensional  arrays  (chains 
of  automata)  are  calculation  of  symmetrical  Boolean  function [4 ^5],  multi-valued 
voting  problem[5-8]  and  firing  squad  synchronization  problem[5,  7-15],  that  are 
formulated  as  follows2. 

Calculation  of  symmetrical  Boolean  functions  [4,  5].  There  is  a  bilateral  lineal 
automata  array  of  n  +  1  identical  automata  (Fig.l,a).  Every  automaton  has  two 
external  inputs  fed  by  Boolean  function  variables  x}  and  variables  rt  determining 

1  It  is  strange,  but  this  definition  turned  out  to  be  the  most  complicated  thing  for  us, 
already  after  successfully  solving  the  problem  of  building  an  equivalent  array. 

2  Solving  these  problems  is  beyond  our  scope.  Our  goal  is  showing  by  examples  what 
sort  of  problems  can  an  array  solve. 
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Fig.  1.  Automata  arrays  calculating  symmetrical  Boolean  functions  (a),  solving  the 
multi-valued  voting  problem  (b),  and  solving  firing  squad  synchronization  problem  (c). 


the  working  numbers  of  a  symmetrical  Boolean  function  (if  r,  =  1,  then  the 
function  contains  the  i-th  working  number).  The  lateral  inputs  (outputs)  are 
fed  by  the  states  of  the  right  and  left  neighbors  (the  automaton  proper  current 
state)3.  The  input  of  the  Z  junction  is  fed  by  the  initiation  signal.  After  some 
time  the  automaton  Aq  must  go  to  the  state  respective  to  the  function  value. 

Multi-valued  voting  problem  [5-8].  The  problem  itself  is  the  following.  There 
are  N  k-v alued  variables  A',-  =  {0,1,2,...,*  -  i},  and  mj  is  the  number  of 
variables  that  take  the  value  j.  The  multi-valued  voting  function  is 

F(Xo, Xi,  ...,I„_i)  =  a  if  ma  =  max(mj).  (1) 

j 

From  (1)  the  problem  formulation  comes  for  a  bilateral  linear-homogeneous  au¬ 
tomata  array.  The  external  inputs  of  the  automata  (Fig.  l,b)  are  fed  with  A>  valued 
external  variables.  Some  time  after  the  initiation  signal  arrives  at  Z  junction, 
A0  automaton  must  hit  to  the  state  respective  to  the  value  of  the  multi-valued 
voting  function. 

Firing  squad  synchronization  problem  [5,  7-15].  In  this  problem,  the  automata 
do  not  have  the  external  informational  inputs  (Fig.l.c).  After  the  input  Z  is  fed 
by  the  external  signal,  all  the  automata  have  to  simultaneously  go  to  a  final 
state  after  some  delay  under  the  condition  that  non  of  them  does  not  hit  to  this 
state  before  the  moment  of  common  synchronization.4 

Other  problems  also  can  be  formulated.  However,  these  three  examples  are 
enough  for  the  goals  of  our  article.  They  cover  three  classes  of  linear  arrays:  arrays 

3  For  arrays  from  Moore  automata. 

4  This  problem  has  been  also  formulated  and  solved  for  the  case  of  initiating  an  arbi¬ 
trary  automaton  in  the  chain  [5,7,8,14]. 
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with  homogeneous  external  inputs  (the  enumeration  of  the  external  inputs  is 
not  related  to  the  automata  enumeration  —  multi-valued  voting  problem),  arrays 
with  non-homogeneous  external  inputs  (the  enumeration  of  variables  r is  related 
to  the  automata  enumeration  —  calculation  of  symmetrical  Boolean  functions) 
and  arrays  without  external  inputs  ( firing  squad  synchronization  problem). 

We  deliberately  do  not  consider  here  solving  these  problems  for  minimum 
time  or  by  automata  with  minimum  number  of  states;  we  are  interested  only 
in  solution  existence.  In  the  same  way,  when  discussing  the  functional  equiv¬ 
alence  problem,  we  will  be  interested  only  in  principal  possibility  of  building 
an  automaton  for  arbitrary  unilateral  array  from  a  known  automaton  for  linear 
bilateral  array,  providing  the  solution  of  a  similar  problem. 

2  Modeling  the  behavior  of  linear  bilateral  arrays  by 
unilateral  rings  [5] 

Let  us  consider  a  bilateral  chain  of  Moore  automata  without  external  inputs. 
The  lateral  output  of  automaton  Aj  at  the  moment  t  +  1  is  its  internal  state 
a  .(f  +  l)  which  depends  on  the  states  of  its  neighbors  at  the  moment  t: 

aj(t  +  l)=f(aJ-i{t),aj{t),aJ+l{t)),  0  <j<n  (2) 

where  a_i  and  an  are  the  boundary  conditions.  First,  instead  of  a  bilateral  chain 
of  Moore  automata  let  us  consider  a  bilateral  ring  of  Moore  automata  where  the 
boundary  conditions  are  replaced  by  modulo-n  adjacent  indexes.  Let  us  put  in 
correspondence  to  such  ring  a  unilateral  ring  of  n  automata  Bj  (Fig.2)  where  the 
j  index  is  counted  by  mod(n).  Every  automaton  Bj  is  a  composition  of  two  sub- 
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Fig.  2.  Unilateral  ring  of  automata. 


automata  U:  and  Dr  Let  us  define  the  transition  functions  of  these  automata 
differently  in  even  and  odd  time  moments: 

dj(2t)  =  Uj(2t  -  1),  Uj(2t)  =  Uj-i(2t  -  1).  (3) 

dj (2t  +  1)  =  Uj(2t),  Uj(2t  +  1)  =  <p{uj-i (2f ) ,  Uj (2t) ,  dj (2t) ) .  (4) 

By  substitution  (3)  into  (4)  we  obtain 

Uj(2 1  +  1)  =  <p(uj-2{2t  -  -  1),  Uj(2t  -  1)).  (5) 

Let  the  automaton  U  have  the  same  set  of  states  as  the  automaton  A  has  and 
the  same  transition  function  at  time  moments  2t  +  1.  Then,  taking  into  account 
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the  shift  of  B  automaton  states  by  one  ring  position  in  every  two  cycles  of  its 
functioning,  we  can  finally  write: 

uU+t) „  (2t  +  1)  =  aj(t).  (6) 

The  equation  (6)  provides  the  way  in  which  we  understand  the  statement  that 
the  unilateral  ring  from  automata  B  simulates  the  behavior  of  the  bilateral  ring 
from  automata  A.°  Note  that  the  sub-automaton  Dj  is  just  an  extra  memory 
register  on  which  at  even  moments  the  state  of  Uj  is  stored. 

Let  us  now  go  back  to  the  bilateral  linear  array.  It  differs  from  a  bilateral 
automata  ring  only  by  the  presence  of  the  boundary  conditions  that  make  the 
internal  and  edge  automata  different.  Since  in  the  unilateral  automata  ring  the 
initial  automata  enumeration  during  the  functioning  is  shifting  along  the  ring, 
the  information  about  the  boundary  conditions  also  must  be  shifting  along  the 
ring  with  the  same  speed.  This  can  be  provided  in  several  ways.  For  example, 
by  introduction  into  the  automaton  B3  the  sub-automaton  Yj  with  two  states 
(Vn-i  =  L  yj  —  0,  j  ^  n  —  1).  Then  the  equations  (3)  and  (4)  look  like  following: 

dj(2t)  =  Uj(2t  -  1);  Uj(2t)  =  Uj_i(2t  —  1);  Vj(2t)  =  yJ_1(2f  -  1); 
dj{2t.  +  1)  =  dj(2t)\  y3{2t,  +  1)  =  yj(2t):  (7) 

uJ{2i  +  1)  =  ^{uj-1{2t),uj(2t),dj(2t),yj_l{2t),yj{2t)). 

The  value  combinations  of  variables  yj_1(2t),yj(2t)  mean:  00  —  internal  au¬ 
tomaton;  10  —  extreme  left  automaton;  01  —  extreme  right  automaton.  In  the 
problems  discussed  above,  the  extreme  left  automaton  is  fed  by  the  initiating 
signal  and  the  same  automaton  generates  the  output  signal. 

In  the  same  way,  the  external  variables  X:  are  incorporated  that  also  must 
be  shifting  along  the  ring  together  with  the  working  indexes  of  the  automata. 
The  variables  are  stored  in  the  registers  and  Xj{2 1)  =  XJ_1(2t  -  1). 

The  above  is  enough  to  build  a  unilateral  automata  ring  by  an  automaton 
that  provides  bilateral  automata  chain  solution  of,  at  least,  problems  of  the  types 
we  mentioned  in  the  introduction. 

3  Modeling  the  behavior  of  a  unilateral  ring  by  an 
automata  array  with  arbitrary  connected  unilateral 
bivalent  graph  of  connection 

Like  in  the  previous  section,  we  will  treat  the  “possibility  of  modeling”  as  the 
existence  of  an  automaton  whose  number  of  states  and  transition  function  does 
not  depend  on  the  number  of  vertexes  and  on  the  connection  graph  of  the  mod¬ 
eling  array.  Besides,  this  automaton  must  regularly  depend  on  the  valence  of  the 
vertexes  and  can  be  built  by  the  automaton  of  the  modeled  ring. 

In  order  to  reduce  the  problem  to  the  one  we  have  already  solved,  let  us  first 
ask  ourselves  whether  an  arbitrary  unilateral  graph  can  be  re-commutated  as  a 

This  result  was  published  in  1973  in  a  Russian  journal  [5];  it  is  practically  unknown 
among  the  specialists. 
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ring.  If  it  can,  the  problem  is  principally  solved.  The  base  for  the  positive  answer 
to  this  question  can  be  the  existence  of  Euler  cycle  in  the  graph.  The  conditions 
for  the  existence  of  Euler  cycle  in  the  graph  are  the  connectness  and  bivalence 
of  the  graph  [16-18]. 

Let  to  every  vertex  of  the  graph  having  k  incoming  and  k  outcoming  arcs 
(property  of  bivalence)  put  into  accordance  a  composition  of  two  automata, 
functional  and  control  ones,  and  a  fully  accessible  commutator  k  x  k  (Fig.3). 
The  automaton  compositions  that  correspond  to  vertexes  of  different  degree 
will  differ  only  in  the  commutator  size.  Talking  about  the  commutator,  we  will 


Control 

l  Commutator 
>  k  x  k 


Fig.  3.  Structure  of  the  automaton,  corresponding  to  to  a  graph  vertex  with  k  incoming 
and  k  outcoming  arcs. 


assume  that  the  arcs  are  bundles  of  wires  connecting  the  vertexes  (automata). 
Via  these  wires  the  automaton  sends  the  information  about  its  state  and  receives 
the  information  about  the  state  of  some  other  automaton  that  is  found  by  the 
commutation  algorithm  to  be  the  succeeder  of  this  automaton  in  the  ring.  Note 
that  only  one  of  input  arcs  is  connected  to  the  corresponding  commutator  input 
through  the  functional  automaton  Bk.  If  the  direct  connection  of  the  input  lik 
and  output  lkj  arcs  of  the  vertex  ak  is  considered  as  electrical  wire  connection, 
then  the  procedure  of  such  commutation6  is  equivalent  to  removing  the  arcs  lik 
and  lkj  from  the  graph  and  creating  a  new  arc  Uj.  Hence,  if  we  build,  as  a  result  of 
the  commutation,  an  Euler  cycle  from  an  arbitrary  connected  bivalent  unilateral 
graph  with  n  vertexes,  this  would  be  functionally  equivalent  to  a  unilateral  ling 
of  n  automata. 

Let  us  consider  the  behavior  of  the  composition  of  controlling  automata  and 
commutator  in  solving  the  problem  of  building  the  Euler  cycle.  First,  we  should 
base  upon  some  algorithm  that  would  provide  finding  the  Euler  cycle  foi  an 
arbitrary  unilateral  connected  bivalent  graph.  The  simplest  algorithm  of  this 
type  could  seemingly  be  the  following:  '‘when  going  through  an  arc,  maik  it, 
when  leaving  a  vertex,  follow  the  unmarked  arc”.  However,  as  shown  by  Ore, 
using  this  algorithm  leads  to  Euler  cycle  only  for  a  very  limited  subclass  of 
unilateral  connected  bivalent  graphs  [19].  We  will  use  here  the  algorithm  by 
Hoang  Tuy  [17]  as  it  was  formulated  by  Zykov  [18],  but  with  some  modification 
discussed  below. 

At  the  initial  moment,  all  the  commutators  are  disconnected,  i.e.  no  one  input 
arc  is  connected  to  any  output  arc.  All  the  control  automata  are  in  the  passive 
state  P.  The  initiating  external  signal  X  =  1  arrives  at  a  certain  automaton 

6  In  case  when  there  is  no  functional  automaton  between  these  arcs. 
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A0  drawing  it  to  the  working  state  MM  The  state  (token)  Mi  goes  to  the  first 
output  arc  (the  enumeration  is  arbitrary).  Control  automaton  Aj  receives  the 
token  Mi  via  one  of  its  input  arcs  and  goes  to  the  state  My ,  producing  a  signal 
for  the  commutator.  The  latter  commutates  the  input  arc  via  which  the  token 
has  come  with  the  first  output  arc,  that  means  passing  this  token  to  the  next 
vertex.  This  procedure  will  last  until  the  token  M\  appears  again  on  one  of  the 
input  arcs  of  some  vertex.  In  this  situation,  the  automaton  keeps  its  internal 
state  Mi,  commutating  the  input  arc  via  which  the  new  signal  has  come  with 
the  next  free  output  arc.  This  process  continues  until  it  turns  out  that  the  com¬ 
mutator  does  not  have  any  free  output  arc.  Since  the  graph  is  bivalent8,  this 
can  occur  only  in  the  initial  vertex.  Indeed,  up  to  this  moment  we  have  been 
following  the  algorithm  “when  coming  to  a  vertex,  go  to  the  first  non-passed 
arc.'1  In  this  case,  as  Ore  [19]  showed,  no  one  arc  will  be  passed  twice;  however, 
in  an  arbitrary  graph  non-passed  arcs  and  vertexes  may  remain.  In  the  next 
phase,  we  will  pass  some  vertexes  for  the  second  time.  The  commutator  of  the 
initial  vertex  commutates  the  last  initiated  input  arc  with  the  first  output  arc 
via  which  the  first  signal  M\  was  sent,  creating  a  cycle.  The  control  automaton 
goes  to  the  state  M2,  injecting  the  respective  signal  (token  M2)  into  the  cycle. 
Token  M2  is  passed  to  the  adjacent  automaton.  This  automaton  commutator 
can  be  in  one  of  the  following  two  states: 

-  All  the  input  arcs  are  commutated  with  all  the  output  arcs.  In  this  case,  the 
control  automaton  goes  to  the  state  M2,  translating  the  token  M2  to  the  output. 

-  The  commutator  has  at  least  one  output  arc.  In  this  case,  the  control  automa¬ 
ton  keeps  the  state  Mi,  breaks  the  commutation  of  the  input  arc  via  which  the 
token  has  come  and  commutates  this  arc  with  the  next  free  output  arc,  injecting 
there  the  token  Mi  (Fig. 4). 


Fig.  4.  Commutation  in  the  case  of  coming  the  token  M2  by  one  of  the  input  arcs  of 
a  vertex  when  it  has  at  least  one  free  output  arc. 


Let  us  assume  that  the  token  came  to  a  vertex  J  via  its  input  arc  l-L1  com¬ 
mutated  with  output  arc  ljk.  Breaking  this  connection,  we  send  the  token  Mi 
via  the  output  arc  ljr  which  had  been  free.  The  token  M\  gets  to  a  certain 

This  can  be  either  a  specially  allotted  automaton,  for  example,  an  automaton  that 
models  the  left  lateral  automaton  in  the  linear  array,  or  an  arbitrary  automaton. 
Every  vertex  has  an  equal  number  of  input  and  output  arcs.  This  number  can  be 
different  for  different  vertexes. 
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connected  bivalent  subgraph  Gj  formed  by  vertex  J  and  some  subset  of  its  non- 
passed  (non-commutated)  arcs.  Since  this  subgraph  is  connected  and  bivalent, 
after  some  time  Mi  will  come  back  to  the  vertex  J  via  a  certain  arc  lmj.  Like 
in  the  previous  case  (subgraph  associated  with  the  initial  vertex).  Mx  will  come 
back  to  the  vertex  J  without  hitting  to  already  commutated  paths  until  only- 
one  non-commutated  output  arc,  namely  ljk  remains  in  the  vertex  J .  The  input 
arc  via  which  AIX  has  come  is  commutated  with  ljk  and  the  control  automaton 
passes  the  respective  token  (Mi)  to  the  output  via  ljk-  After  that  the  marker 
Mi  propagates  only  within  already  commutated  cycle  until  it  comes  back  to  the 
initial  vertex  where  it  turns  to  Mo  according  to  the  rule  described  above,  keeping 
the  internal  state  of  the  control  automaton. 

Thus,  every  time  the  markers  M2,Mi  rotates,  another  path  is  added  to  the 
initial  cycle.  This  continues  until  M2  comes  back  to  the  initial  vertex.  It  means 
that  all  the  commutators  on  the  way  of  Mo  have  been  really  commutated,  i.e. 
the  Euler  cycle  has  been  completely  formed.9  Indeed,  the  fact  that  M2  has  come 
back  to  the  initial  vertex  indicates  that  no  one  vertex  it  has  passed  has  free 
(non-commutated)  output  arcs  (otherwise,  the  token  M2  would  be  replaced  by 
Mi),  i.e.  all  the  arcs  of  the  graph  have  been  passed  by  Mo  and  they  form  a  cycle. 

The  appearance  of  M2  at  the  input  of  the  initial  vertex  switches  the  control 
automaton  to  the  working  state  W  and  passes  the  contxol  to  the  functional 
automaton  that  prior  to  this  time  has  only  translated  markers  to  its  output. 
If  a  signal  belonging  to  the  set  of  functional  automaton  states  appears  at  the 
input  of  any  other  vertex  different  from  the  initial  one,  the  control  automaton 
of  this  vertex  goes  to  the  state  W  and  the  control  is  passed  to  the  functional 
automaton. 

4  Conclusion 

It  follows  from  the  above  that  there  is  a  way  of  allowing  by  a  bilateral  chain 
automaton  to  construct  an  automata  composition  that  would  solve  the  same 
problem  on  a  random  unilateral  bivalent  graph.  Doing  so,  we  keep  the  basic 
property  of  the  automata  designed  for  this  type  of  problems:  the  automaton 
complexity  (number  of  internal  states  and  transition  functions)  in  every  vertex 
of  the  graph  does  not  depend  on  the  number  of  vertexes  and  type  of  the  graph. 
We  are  far  from  pretending  that  the  suggested  solution  is  optimal,  either  in 
terms  of  the  algorithm  complexity  or  in  terms  of  the  time  needed  foi  solving 
the  problems.  Our  goal  was  just  to  prove  that  the  solution  exists,  because  this 
fact  has  been  exposed  some  doubt  in  private  talks  and  seminar  discussions.  Fur¬ 
thermore,  because  of  this  goal  we  did  not  continue  our  discussion  to  automaton 
construction,  limiting  ourselves  by  algorithm  description. 

On  the  other  hand,  it  does  not  mean  that  any  problem  of  those  mentioned 
above  can  be  solved  l.y  rhe  method  we  suggest.  For  example,  the  problem  of 

9  The  difference  of  tlii-  algorithm  from  the  Tuy’s  algorithm  is  that  in  the  last  initial 
cycle  and  added  loci-  an-  enumerated  by  the  sequence  of  numbers,  but  in  our  case 
there  is  no  enumerat  i.  >n:  it  is  replaced  by  switching  of  two  markers. 
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calculating  symmetrical  Boolean  functions  requires  that  the  inputs  are  linked 
to  the  numbers  of  the  automata  in  the  ring,  while  the  suggested  algorithm  of 
building  the  Euler  cycle  does  not  allow  us  to  provide  such  a  linkage. 

Finally,  let  us  note  that  it  looks  fairly  interesting  to  consider  the  parallel 
algorithm  of  building  the  Euler  cycle,  when  the  initiating  token  is  injected  into 
the  graph  via  all  the  output  arcs  of  the  initial  vertex. 
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Abstract  -  In  this  paper  we  describe  a  software  package  corresponding  to  a 
DMS  -  Distribution  Management  System  -  that  is  structured  in  terms  ot  a 
distributed  multitask  client-server  architecture.  The  system  is  implemented 
using  Object  Oriented  technology  and  integrates  a  number  of  power  system 
application  tools  that  are  structured  in  terms  of  main  coordinator  objects 
calling  and  directing  the  object  models  of  system  components  as  well  as  other 
auxiliary  interface  and  calculation  objects.  These  tools  can  be  activated  by 
several  entities  corresponding  to  clients  in  terms  of  a  distributed  architecture. 

1.  Problem  Positioning 

For  several  years  electric  utilities  directed  a  major  percentage  of  their  investments  to 
the  generation  and  high  voltage  transmission  systems.  This  fact  explains  that 
generation  and  transmission  high  voltage  systems  are  now  characterized  by  higher 
levels  of  automation  and  performance  indices  both  in  terms  of  economic  efficiency 
and  reliability  levels.  The  referred  trend  started  to  change  some  years  from  now  with 
the  consequence  that  today  much  more  of  the  efforts  are  directed  to  the  distribution 
area.  These  efforts  lead  to  the  development  and  installation  of  new  automation, 
telemetering  and  communication  facilities  at  the  distribution  level  in  order  to  have 
tools  to  monitor  the  networks,  to  operate  systems  in  a  remote  and  central  way  and  to 
reduce  the  number  of  interruptions  as  well  as  the  interruption  times.  The  investments 
in  the  distribution  area,  together  with  new  technological  advances,  made  it  feasible  to 
have  in  real  time  in  Control  Centers  values  for  an  increasing  number  of  variables  as 
well  as  indications  regarding  the  topology  in  operation. 

Anyway,  distribution  networks  have  some  distinctive  aspects  when  compared  with 
higher  voltage  transmission  systems  that  prevent  the  direct  migration  of  solutions  and 
applications  common  and  well  established  in  EMS  -  Energy  Management  Systems  - 
at  the  generation/transmission  level.  Apart  from  that,  the  size  of  distribution  networ  s 
turns  it  most  probable  that  the  investments  will  be  distributed  along  a  large  number  of 
years  until  an  adequate  level  of  automation  and  tele-operation  is  achieved.  Finally,  the 
presence  of  large  numbers  of  independent  generators  and  the  liberalization  of  the 
electricity  sector  is  already  imposing  non  negligible  impacts  in  the  distribution  sector 
as  the  eligibility  levels  for  accessing  the  open  markets  start  to  decrease. 

All  these  changes  and  challenges  suggest  it  is  crucial  to  develop  new  DMS  - 
Distribution  Management  Systems  -  according  to  the  requirements  of  the  distribution 
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sector  in  the  advent  of  the  open  market.  This  means  we  would  not  be  tied  to  solutions 
already  developed  in  the  generation/transmission  systems  and  integrated  in  EMS 
systems,  but  we  would  rather  be  concerned  in  developing  a  new  system.  This 
motivated  our  team  to  work  on  this  area  and  to  present  to  the  PRAXIS  program  - 
Portuguese  state  program  for  financing  scientific  and  development  projects  -  the 
PREORD  proposal  in  order  to  develop  a  new  DMS  system  using  new  technologies 
and  more  advanced  programming  facilities.  In  any  case,  the  development  of  DMS 
share  with  EMS  some  general  guidelines  (see  reference  6.  for  instance)  as  flexibility, 
reusability,  openness  (in  terms  of  their  portability,  interoperability,  interconnectivity 
and  scalability),  security  and  accessibility. 

The  software  package  described  in  this  paper  can  be  integrated  in  the  move  to  an 
increasing  level  of  automation  of  distribution  systems.  Apart  from  that,  one  witnesses 
an  important  evolution  in  several  technological  aspects  regarding  databases, 
programming  languages  and  hardware  structures  themselves  and  new  requirements  in 
getting  a  more  intuitive  and  friendly  interface  with  the  user  and  of  supporting  more 
complex  and  involving  functions.  In  this  scope,  several  applications  were  developed 
in  recent  years  adopting  the  Object  Oriented  paradigm.  As  examples,  references  1,  3, 
4,  5,  8,  10  and  11  describe  several  applications  of  Object  Oriented  technology  to 
power  systems.  In  this  scope,  these  references  describe  general  power  system  models, 
graphical  user  interfaces,  topology  processor  and  power  flow  algorithms.  According 
to  these  references,  it  is  suggested  that  distributed  systems  in  general,  and  Object 
Oriented  applications  in  particular,  are  the  most  adequate  and  flexible  approaches  to 
be  adopted  to  develop  new  generation  DMS  systems. 

2.  Object  Oriented  Basic  Concepts 

Under  the  Object  Oriented  paradigm,  the  objects  correspond  to  the  main  units  in  the 
strategy  adopted  to  solve  a  particular  problem.  The  adoption  of  an  Object  Oriented 
approach  mainly  aims  at  catching  the  concepts  of  real  world  that  are  significant  for 
the  application  being  developed.  Under  this  paradigm,  real  systems  are  usually 
structured  in  terms  of  a  number  of  objects  that  can  be  grouped  in  Classes  sharing  a 
common  set  of  variables  and  methods  -  eventually,  calculation  methods  that 
manipulate  the  particular  values  assumed  by  those  variables  (see  references  3.  and  7.). 
The  objects  sharing  the  same  information,  from  these  two  points  of  view,  are  included 
in  a  Class  so  that  a  particular  object  can  be  seen  as  an  instance  of  that  Class.  From  this 
point  of  view,  variables  in  a  particular  object  in  a  Class  are  assigned  particular  values 
corresponding  to  instances  defined  for  those  variables. 

It  is  important  to  notice  that  the  above  Class  definition  is  flexible  enough  to  allow  the 
construction  of  a  hierarchical  structure  in  which  some  Sub-Classes  are  defined  under 
a  Class  placed  at  a  superior  level.  In  this  structure,  there  is  a  common  set  of 
information  -  core  -  that  is  common  to  all  Sub-Classes.  This  information,  both  in 
terms  of  variables  and  methods,  is  included  in  the  Class  at  the  superior  level  and  it  is 
inherited  by  all  Sub-Classes  This  organizational  approach  requires  an  higher  level  of 
abstraction  in  the  sense  that  we  will  have  to  structure  the  system  under  analysis 
independently  of  particular  instances  of  variables  and  recognizing  what  is  common  to 
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several  objects  —  leading  to  a  Class  definition  —  and  what  is  different  between  them 
leading  to  Sub-Classes.  Apart  from  this  higher  abstraction  degree,  the  modularity  is 
also  favoured  because  a  change  in  a  private  method  or  function  at  a  certain  level  of 
the  hierarchy  does  not  affect  the  remaining  Classes  and  Sub-Classes. 

Regarding  the  DMS  under  development,  we  will  describe  in  the  next  section  its 
general  architecture,  the  referred  hierarchical  structure  as  well  as  the  Class  Models  for 
some  particular  objects  corresponding  to  network  devices  and  application  functions. 

3.  The  PREORD  Platform 

The  software  package  has  a  modular  multitask  client-server  architecture  and  is 
supported  by  a  commercial  Object-Oriented  Database  Management  System  - 
ObjectStore.  The  modules  corresponding  to  specific  algorithms  and  applications  of 
the  DMS  are  connected  to  this  platform.  The  clients  correspond  to  Java  applets  and 
the  modules  related  to  DMS  applications  or  algorithms  are  developed  in  Java  or  C++ 
and  are  registered  in  the  Java  DMS  server  as  services.  The  server  is  a  Java  application 
that  uses  the  Java  language  facilities  to  handle  concurrency  in  a  transparent  manner. 
The  reliance  on  Java  gives  us  platform  independence.  When  a  user  opens  an  HTML 
page  in  the  web  server  -  the  page  contains  a  reference  to  the  Java  applet  -  the  applet 
is  downloaded  and  starts  running  in  the  machine  of  the  client. 

In  our  software  package  we  use  an  Object  Oriented  approach  as  a  way  to  build  a 
mathematical  model  for  the  physical  system  to  be  analysed.  The  structural  unit  of  this 
model  is  the  object  and  they  represent  concepts  existing  in  the  real  world.  Each 
structural  unit  has  a  static  identity  and  a  dynamic  associated  to  the  transformations 
that  can  affect  its  state  variables.  The  rules  directing  the  interactions  between  different 
units  of  the  system  are  also  defined.  The  objects  are  grouped  in  classes  so  that  it  is 
possible  to  study  the  interdependencies  inside  the  software  in  order  to  minimize  them. 
We  used  a  CASE  tool  to  support  the  development  process  and  generate  the  UML  - 
Unified  Modelling  Language  -  diagrams  (see  reference  9.).  The  source  code  is  under 
version  control,  and  through  the  use  of  the  CASE  tool  it  is  possible  to  automatically 
generate  code  from  the  detailed  UML  class  diagrams. 

In  this  development  phase,  special  attention  is  devoted  to  the  object  models  of  the 
components  of  power  system  networks  as,  for  instance,  lines,  cables,  transformers  and 
generators.  These  models  integrate  data  corresponding  to  static  and  dynamic 
characteristics.  Dynamic  data  are,  for  instance,  voltage  magnitude  and  phases,  branch 
currents  and  generations.  The  object  model  of  each  component  also  integrates 
information  regarding  calculations  that  can  be  performed  with  static  and  dynamic 
information.  Apart  from  that,  the  00  model  of  the  system  includes  a  number  of 
objects  to  coordinate  the  actions  of  those  component  objects  and  other  calculation  or 
interface  objects.  From  this  point  of  view,  each  power  application  function  -  as  for 
instance  the  power  flow  application  -  is  structured  in  terms  of  a  mam  coordinator 
object  that  has  several  sub-coordinator  objects  depending  from  it  and  that  gives  orders 
to  component  objects  and  other  auxiliary  calculation  or  interface  objects. 
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The  Object  Oriented  model  of  the  system  is  organized  in  three  levels: 

Electric  Level  -  at  this  level  the  system  has  information  about  power  system 
components.  At  the  higher  hierarchical  position,  there  is  the  Electric  Network 
that  integrates  information  regarding  Electric  Connections,  Electric  Equipment 
and  Ground.  The  Electric  Equipment  Object  is  organized  in  terms  of  a  number  of 
subclasses  corresponding  to  one-terminal,  two-terminal  and  three-terminal 
devices  as  detailed  in  Figure  1 .  The  Electric  Connection  corresponds  to  a  bus  but, 
in  order  to  give  more  flexibility  to  the  software,  we  also  included  the  sub-class 
OtherNetwork  to  represent  the  equivalent  circuit  of  networks  to  which  the 
network  under  analysis  is  connected.  Finally,  the  Ground  sub-class  includes 
information  regarding  the  connections  of  an  electric  network  to  the  ground; 
Topologic  Level  -  the  available  information  regarding  the  models  of  electric 
devices,  the  connectivity  and  status  of  switching  devices  has  to  be  analysed  in 
order  to  produce  a  mathematical  model  of  the  system.  This  is  achieved  by 
Topology  Processor  application  leading  to  a  Class  Diagram  organized  in  terms  of 
islands  -  energized  or  non-energized  -  corresponding  to  sets  of  electrically 
connected  components  (nodes  and  branches); 

Geographic  Level  -  at  this  level  the  software  will  be  interconnected  with  a 
Geographic  Information  System  -  GIS  -  given  its  ability  to  represent  in  different 
layers  large  amounts  of  data  having  a  geographic  dispersed  nature; 

According  to  Figure  1,  the  information  regarding  the  components  of  power  systems  is 
structured  in  terms  of  one,  two  and  three  terminal  devices.  For  illustration  purposes, 
let  us  consider  the  Transformers.  The  information  both  for  two  and  three  winding 
transformers  is  structured  in  terms  of  the  Sub-Classes  Winding  and  Regulation.  The 
Sub-Class  Winding  includes  variables  and  methods  designed  to  represent  one  winding 
of  a  transformer  while  Regulation  includes  information  related  to  the  voltage 
regulation  abilities  of  a  transformer.  Finally,  the  Class  2Winding  Transformer  gathers 
all  this  information  in  order  to  model  such  a  device  (Figure  2). 


Fig.  1  -  Class  diagram  for  Electric  Equipment. 
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2WmdingTransformer 


SnumberObjectsClass:  int=0 
-id:  int 

-yghnnr  C nmnl^v  - 

-getldO'.int 
-getYshuntl  ):Complex 
TgetYShuntPrimary():CompIex 
TgetYShuntSecondary():Complex 
-getZSeriesf)  :C  omplex 

+getPrimary  Winding!)  Winding 
+getSecondary  Winding)):  Winding 
-newId():void 

-setYShunt(Complex  newYShunt):void 
-computeYShuntPrimary():Complex 
-coinputeYShuntSecondary();Complex 
-computeZSeries():Complex 

■rsetPrimary  W  inding(new  W inding  Primary  winding) 

-setSecondary  Winding!  newWinding  Secondary  Winding) 

-2WindingTransfonner(int  newTenninall.  newTennina!2,  Complex  newYShunt) 


Fig.  2  -  Class  diagram  for  2  Winding  Transformer. 

The  client  interacts  with  the  server  in  order  to  request  services  by  activating  some 
objects  organized  in  terms  of  calculation  or  coordinator  objects.  As  examples,  we 
present  in  the  following  paragraphs  the  main  structure  corresponding  to  the  Topology 
Processor  and  Single  Phase  Power  Flow. 

The  Topology  Processor  builds  a  simplified  connectivity  model  of  the  system  taking 
into  account  the  position  of  switching  devices.  The  Class  TopologyProcessor  includes 
two  subclasses  leading  to  the  single  phase  equivalent  and  to  the  positive  sequence 
circuits.  The  single  phase  circuit  is  used  in  the  single-phase  power  flow  and  state 
estimation  and  the  positive  sequence  circuit  is  used  in  the  three  phase  symmetric  short 
circuit  analysis.  As  an  example,  SinglePhaseTP  directs  the  following  objects: 

Buses  -  simplifies  the  network  by  identifying  individual  nodes  that  are  connected 

by  closed  switching  devices  and  joining  them; 

Createlsland  -  checks,  step  by  step,  the  connectivity  of  all  nodes  remaining  in  the 
system  in  order  to  identify  islands  and  to  create  data  structures  for  them; 
ClassifVBuses  -  classifies  the  buses  in  the  system  as  PV  and  PQ  and  selects  a  PV 
bus  for  reference  in  each  island; 

TslandClassification  -  this  object  classifies  the  pre-identified  islands  in  terms  of 

being  energized  or  not  energized;  . 

TracingFunctions  -  perform  facilities  as  Single  Tracing,  Multiple  Tracing, 
Tracing  Upstream,  Tracing  Downstream  and  Tracing  to  Ground; 

The  Single  Phase  Power  Flow  is  based  on  the  Newton  Raphson  method  and  it  runs  for 
an  island  of  the  system  identified  by  the  Topology  Processor.  SinglePhasePF  gets  the 
id’s  of  the  equipments  in  the  selected  island  and  directs  calculation  objects  as: 
Initialization  -  initializes  voltages  and  phases  for  all  buses; 

BnildlnvJacobean  -  using  sparcity  techniques  builds  and  inverts  the  Jacobean 
matrix  at  an  iteration  of  the  algorithm; 

F.valuatePOMismatches  -  computes  injected  powers  and  evaluates  mismatches 
for  active  and  reactive  powers; 
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EvaluateV8Increments  -  computes  increments  of  voltages  and  phases  using  the 
inverted  Jacobean  and  the  power  mismatches  and  evaluates  the  convergence; 

-  AdjustTaps  -  checks  if  voltage  taps  have  to  be  adjusted  for  on  line  voltage 
regulation  transformers; 

BuildResults  —  computes  the  final  values  for  voltages,  phases,  generated  powers, 
power  and  current  flows  and  losses. 

4.  Conclusions 

In  this  paper  we  describe  the  main  guidelines  of  a  DMS  software  package  that  adopts 
a  distributed  client-server  architecture.  The  DMS  applications  are  organized  in  terms 
of  services  that  can  be  activated  by  clients  when  entering  in  the  DMS  Web  page.  The 
distribution  network  is  modelled  using  Objected  Oriented  Concepts  and  is  structured 
in  three  levels  -  Electric,  Topologic  and  Geographic.  The  power  system  applications 
are  implemented  in  terms  of  calculation  or  coordinator  objects  given  that  they  can  be 
used  to  direct  the  activation  of  other  objects  and  the  flow  of  information.  From  the 
experience  gained  so  far  we  consider  that  the  use  of  00  technology  corresponds  to  a 
major  decision  given  the  influence  it  has  in  all  remaining  development  steps. 
Currently,  we  are  finishing  the  implementation  of  the  coordinator  and  calculation 
objects  related  to  some  power  system  applications  and  addressing  issues  related  to 
real  time  processing  considering  that  in  a  system  as  this  one  dynamic  information  is 
received  from  remote  units  periodically.  At  the  end  of  this  project  we  aim  at  having 
ready  a  prototype  of  a  DMS  system  in  order  to  test  it  in  closer  to  reality  environments. 
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Abstract.  This  paper  introduces  the  dynamic  aggregation  of  pages  in 
Nautilus,  which  main  features  are:  lock-based  scope  consistency,  multi¬ 
threaded  and  page-based  DSM  system.  The  dynamic  aggregation  consists 
in  considering  a  larger  granularity’s  unit  than  a  page,  in  a  page-based 
DSM  system.  For  the  first  time,  an  introductory  evaluation  of  the  influ¬ 
ence  of  the  dynamic  aggregation  technique  in  the  speedup  of  a  DSM  with 
Nautilus’s  features  is  done.  The  first  results  show  that  this  technique  can 
improve  the  Nautilus’s  speedup  up  to  13.10%.  The  benchmarks  evaluated 
in  this  study  are  SOR  (from  Rice  University)  and  LU  (from  SPLASH-2). 


1  Introduction 

The  evolution  and  the  decreasement  of  costs  of  interconnection  technologies  and 
PCs  have  made  the  networks  of  workstations  (NOWs)  the  most  used  as  a  parallel 
computer.  Big  projects  such  as  Beowulf[ll]  can  be  mentioned  to  exemplify  this. 

The  Distributed  Shared  Memory  (DSM)  paradigm[8],  which  has  been  largely 
discussed  for  the  last  9  years,  is  an  abstraction  of  shared  memory  which  permits 
to  view  a  network  of  workstations  as  a  shared  memory  parallel  computer. 

In  terms  of  granularity,  DSMs  have  chosen  in  most  cases  page-grained  ap¬ 
proaches  instead  of  fine-grained  ones.  Also,  the  study  of  Iftode[17]  showed  that 
for  several  applications  from  SPLASH-2,  page-grain  DSMs  perform  similarly 
to  or  better  than  fine-grain,  although  generally  higher  bandwith  and  message 
handling  costs  favor  page-based  DSM  while  lower  latency  favors  fine-grained 
approach[17]. 

Some  important  DSMs  which  belong  to  the  second  generation  like  Quark- 
s[7],  TreadMarks[3] ,  CVM[10],  Brazos[18]  and  Nautilus[5],  are  page-based  DSM 
systems.  And,  as  it  was  said  in  the  last  paragraph,  page-based  solutions  have 
achieved  good  speedups  for  several  benchmarks,  but  there  is  still  available  place 
for  improvements. 

In  page-based  DSM  systems,  shared  memory  accesses  are  detected  using 
virtual  memory  protection,  thus  one  page  is  the  unit  of  access  detection  and 
can  be  used  as  an  unit  of  transfer.  Depending  on  the  memory  consistency  model 
and  the  situation,  also  the  diffs1  are  used  as  an  unit  of  transfer.  For  example, 
in  homeless  lazy  release  consistency  (LRC),  as  TreadMarks,  if  the  node  has  a 

*  {mario,geraldo}@regulus. pcs. usp.br 

1  diffs:  codification  of  the  modifications  suffered  by  a  page  during  a  critical  section 
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dirty  page,  diffs  are  fetched  from  several  nodes,  when  an  invalid  page  is  accessed. 
On  the  other  hand,  in  JIAJIA,  pages  are  fetched  from  the  home  nodes  when  a 
remote  page  fault  occurs. 

The  unit  of  access  detection  and  the  unit  of  transfer  can  be  increased  by  using 
a  multiple  of  the  hardware  page  size.  In  this  way,  if  aggregation  is  done,  false 
sharing  is  increased.  Aggregation  reduces  the  number  of  messages  exchanged.  If 
a  processor  accesses  several  pages  successively,  a  single  page  fault  request  and 
reply  can  be  enough,  instead  of  multiple  exchanges,  which  are  usually  required. 
A  secondary  benefit  is  the  reduction  of  the  number  of  page-faults.  On  the  other 
hand,  false  sharing  can  increase  the  amount  of  data  exchanged  and  the  number 
of  messages[16]. 

The  main  goal  of  this  paper  is  to  evaluate  the  page  aggregation  technique[16] 
in  Nautilus  DSM  system.  The  page  aggregation  technique  is  evaluated  in  Nau¬ 
tilus  with  a  PC’s  network,  with  a  free  operation  system.  The  speedups  of  Tread- 
Marks  made  it  the  main  DSM  used  by  the  scientific  community,  as  a  reference 
of  optimal  speedups.  The  speedups  related  to  TreadMarks  performance  are  used 
only  as  an  allusion  of  good  performance  and  other  study[22]  have  confronted 
TreadMarks  versus  Nautilus,  thus  the  main  goal  is  not  to  compare  TreadMarks 
and  Nautilus. 

The  evaluation  comparison  is  done  by  applying  different  benchmarks:  LU 
(kernel  from  SPLASH-2)[lo]  and  SOR  (from  Rice  University).  The  environment 
of  the  comparison  is  a  8PC’s  network  interconnected  by  a  fast-ethernet  shared 
media.  The  operating  system  used  in  each  PC  is  Linux  (2.x).  This  study  is  a 
preliminary  evaluation  of  this  technique  and  only  two  aggregation  sizes  are  used: 
4kB  (default)  and  8kB. 

2  Nautilus  DSM 

The  main  motivation  of  the  new  software  DSM  Nautilus  is  to  develop  a  DSM  with 
a  simple  consistency  memory  model,  in  order  to  provide  good  speedups,  and  also 
another  one  with  a  simpler  user  interface,  totally  compatible  with  TreadMarks 
and  JIAJIA. 

Nautilus  is  a  page-based  DSM,  as  TreadMarks  and  JIAJIA.  In  this  scheme, 
pages  are  replicated  through  the  several  nodes  of  the  net,  allowing  multiple 
reads  and  writes[8],  thus  improving  speedups.  By  adopting  the  multiple  writer 
protocols  proposed  by  Carter[2],  false  sharing  is  reduced  and  good  speedups 
can  be  achieved.  The  mechanism  of  coherence  adopted  is  write  invalidation [8], 
because  several  studies  [2] [3]  [4]  [12]  show  that  this  type  of  mechanism  provides 
better  speedups  for  general  applications.  Nautilus,  as  JIAJIA  does,  uses  scope 
consistency  model,  which  is  implemented  through  a  locked-based  protocol[13]. 

The  implementation  of  the  lock-based  protocol  is  done  in  Unix  using  the 
mprotect()  primitive.  With  this  primitive,  pages  can  be  in  RO,  INV  or  RW 
states,  thus  pages  can  have  their  states  changed  easily. 

Let’s  summarize  Nautilus  features:  i)  scope  consistency;  ii)  multiple  writer 
protocols;  iii)  multi-threaded  DSM:  threads  to  minimize  the  switch  context;  iv) 
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no  use  of  SIGIO  signals(which  notice  the  arrival  of  a  network  message);  v)  min¬ 
imization  of  diffs  creation;  vi)  primitives  compatible  with  TreadMarks,  Quarks 
and  JIAJIA.  Nautilus  follows  the  lock-based  protocol  proposed  by  JIAJIA[12], 
because  of  its  simplicity,  thus  minimizing  the  overheads.  Based  on  this  proto¬ 
col,  the  pages  can  be  in  one  of  three  states:  Invalid(INV),  Read-Only  (RO)  and 
Read- Write(RW) .  In  addition,  the  home  nodes  of  the  pages  always  contain  a 
valid  page,  and  the  diffs  corresponding  to  the  remote  cached  copies  of  the  pages 
are  sent  to  the  home  nodes.  A  list  with  the  pages  to  be  invalidated  in  the  node 
is  attached  to  the  acquire  lock  message. 


3  Page  Aggregation 


In  terms  of  implementation,  following  the  other  DSMs  directions,  in  Nautilus 
there  is  a  handler  responsible  for  request  a  page  from  a  remote  node  when  a 
segmentation  fault  occurs.  When  a  page  is  accessed  and  it’s  in  the  INV  state  a 
SIGSEVG  signal  is  generated  and  the  respective  handler,  as  it  was  said  before, 
requests  the  page  from  the  home  node.  When  the  page  arrives  the  primitive 
mprotect()  changes  the  state  from  INV  to  RO. 

When  the  page  is  written,  another  SIGSEGV  signal  is  generated  and  the 
primitive  mprotect()  changes  the  state  of  the  page  from  RO  to  RW.  After  the 
generation  of  the  diffs,  also  with  the  mprotectQ  primitive,  pages  go  to  RO  state 
again.  And,  when  the  write-notices,  indicating  the  pages  are  modified  by  other 
nodes,  arrive,  pages  go  to  INV  state  (again  with  the  use  of  mprotect()  primitive). 

The  primitive  mprotect()  permits  to  consider  a  granularity  multiple  of  a  page, 
thus  giving  the  same  permission  for  a  region  multiple  of  a  page.  Thus,  this  fact 
gives  the  condition  to  modify  more  than  one  page  at  the  same  time,  which  is 
named  page  aggregation  technique. 

The  study  [16]  says  that  if  aggregation  is  done,  false  sharing  is  increased  and 
aggregation  reduces  the  number  of  messages  exchanged.  Also,  processor  accesses 
several  pages  successively,  a  single  page  fault  request  and  reply  can  be  enough, 
instead  of  multiple  exchanges  of  requests  and  replies,  which  are  usually  required. 
The  study  [16]  also  shows  that  there  is  a  reduction  of  the  number  of  page-faults, 
but  false  sharing  can  increase  the  amount  of  data  exchanged  and  the  number  of 
messages. 

This  study  is  an  original  contribution  because  the  study  [16]  is  applied  with 
TreadMarks,  which  is  a  lazy  release  consistency  homeless  system,  and  this  tech¬ 
nique  until  the  present  was  not  applied  in  other  scope  consistency,  multi-threaded 
and  for  Unix  DSM.  which  are  jNautilus  s  features. 

By  changing  the  paw  size  default  (4kB)  to,  for  example,  8kB  using  mprotect() 
primitive  in  Nautilus,  it  A  possible  to  evaluate  the  effects  of  the  incremented  size 
in  page  fault  reduction  in  the  speedups. 
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4  Experimental  Platform  and  Result  Analysis 

The  results  reported  here  are  collected  on  a  8  PC’s  network.  Each  node  (PC)  is 
equipped  with  a  K6  -  233  MHz  (AMD)processor,  64  MB  of  memory  and  a  fast 
ethernet  card  (100  Mbits/s)  .  The  nodes  are  interconnected  with  a  hub.  In  order 
to  measure  the  speedups,  the  network  above  was  completely  isolated  from  any 
other  external  networks.  Each  PC  runs  Linux  Red  Hat  6.0.  The  experiments  are 
executed  with  no  other  user  process. 

In  this  study,  two  sizes  are  considered  for  page  size:  4kB,  which  is  the  default 
(memory  hardware)  and  8kB,  which  is  multiple  of  4kB. 

The  test  suite  includes  some  programs:  LU  (from  SPLASH-2[15])  and  SOR 
(from  Rice  University).  The  data  input  size  N  used  in  the  LU  evaluation  is 
N=1024.  The  data  input  size  of  red  and  black  matrix  used  in  SOR  evaluation 
is  1728x1728;  the  number  of  iterations  for  the  SOR  benchmark  is  10  . 

Before  presenting  the  results  and  their  analysis,  it  is  necessary  to  emphasize 
that  the  execution  time  for  number  of  nodes  =  1  in  all  evaluated  benchmarks 
is  obtained  from  the  sequential  version  of  the  benchmarks  without  any  DSM 
primitive.  So,  the  primitive  used  to  allocate  memory  to  obtain  the  sequential 
time  (number  of  nodes  =  1)  is  malloc(),  default  primitive  of  C  programming. 

In  order  to  have  an  accurate,  homogeneous  and  fair  comparison,  the  same 
programs  are  executed  using  TreadMarks  (version  1.0.3).  There  are  some  con¬ 
straints  with  TreadMarks  version  (1.0.3)  used:  i)  the  applications  were 
executed  and  the  speedups  measured  using  Nautilus  running  on  up  to  8  nodes; 
”)bigger  input  sizes:  the  shared  memory  size  is  limited  in  this  version;  iii)only 
time  and  speedups  can  be  obtained  from  this  version,  thus  it  was  not  possible 
to  obtain  number  of  page  faults  and  SIGSEGV  signals. 

Table  1  show  some  features  and  results  of  the  benchmarks:  sequential  time 
(t(l)),  8-processor  parallel  run  time (8),  speedup  (Sp),  remote  get  page  request 
counts  (gp)  and  number  of  local  SIGSEGV  of  Nautilus(SG).  The  sequential  time 
t(l)  was  obtained  from  the  sequential  program  without  no  DSM  primitives  and 
malloc()  primitive.  In  order  to  evaluate  the  adaptive  write  detection  speedup, 
remote  get  page  request  counts  and  the  number  of  local  SIGSEGVs  of  Nautilus 
are  taken.  For  table  1,  Tmk  means  TreadMarks,  N4k  means  Nautilus  using  4kB 
page  size  and  N8k  means  Nautilus  using  8kB  page  size. 

For  both  benchmarks  evaluated  in  this  study,  a  big  reduction  of  SIGSEGV 
signals  can  be  observed  from  tablel,  by  looking  at  SG  rows.  Also,  it  can  be 
noticed  from  this  table  a  reduction  of  the  number  of  page  fault  requests.  These 
two  results  were  obviously  hoped  because,  as  the  page  size  increases,  more  data 
is  included  inside  a  page  and  as  an  immediate  consequence,  less  number  of  page 
faults  and  requests  for  pages  are  necessary. 

For  LU,  a  reduction  of  2.2%  is  observed  when  the  dynamic  aggregation  tech¬ 
nique  is  applied.  Although  the  number  of  SIGSEGVs  and  the  number  of  get  page 
requests  decreases  by  36.98%  and  19.37%  respectively,  as  can  be  observed  from 
1,  the  employment  of  dynamic  aggregation  technique  changes  the  data  distribu¬ 
tion.  This  new  data  distribution  change  the  home  nodes,  giving  a  distribution 
not  so  adequate  as  the  initial  (4kB),  decreasing  the  speedups  of  Nautilus. 
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app 

LU 

SOR 

t(l) 

350.90 

29.10 

t(8).Tmk 

54.45 

8.66 

K 

oo 

4-J 

54.32 

7.66 

t(8).N8k 

55.52 

6.54 

Sp.Tmk 

6.44 

3.36 

Sp.N4k 

6.46 

3.80 

Sp.N8k 

6.32 

4.45 

SG.N4k 

7980 

12425 

SG.N8k 

5029 

7912 

gp.N4k 

1528 

118 

gp.N8k 

1232 

72 

Table  1.  table  comparing  N4k  x  N8k 


For  SOR,  wich  has  good  data  distribution,  the  dynamic  aggregation  tech¬ 
nique  decreased  the  number  of  SIGSEGVS  by  36.00%,  and  also  the  number  of 
pages  requested  by  38.00%.  These  reductions  justify  the  increasement  of  the 
speedups  of  13.1%. 

The  goal  of  this  paper  is  not  to  compare  Nautilus  with  TreadMarks,  as  it 
was  done  in  the  study  of  Marino[22],  For  Matmul,  Nautilus  outperforms  Tread- 
Marks  by  18.6%;  for  SOR  Nautilus(4k)  outperforms  TreadMarks  by  13.09%  and 
Nautilus(8k)  outperforms  TreadMarks  by  32.44%. 


5  Conclusion 

In  this  paper  the  page  aggregation  technique  for  a  DSM  which  has  similar  Nau¬ 
tilus’s  features  was  presented.  For  reference  of  optimal  speedups,  TreadMarks 
was  employed  to  have  a  fair  comparison. 

It  was  seen  that  the  page  aggregation  technique  has  improved  Nautilus 
speedups  in  until  13.10%  for  SOR  benchmark,  reducing  the  number  of  page 
faults  and  the  number  of  SIGSEGVs.  For  LU,  the  dynamic  aggregation  tech¬ 
nique  decreased  the  speedup  possibly  due  to  the  changing  of  the  home  nodes. 

In  addition,  the  speedup  of  Nautilus  was  compared  to  TreadMarks,  but  not 
as  the  main  goal  of  the  paper. 

In  our  future  works  other  applications  will  be  tested  and  other  page  sizes, 
for  example,  16kB  and  32kB,  also  will  be  evaluated. 
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Abstract.  A  parallel  water  quality  modelling  algorithm  is  presented  for  tracking 
dissolved  substances  in  water-distribution  networks.  The  algorithm,  based  on  a 
parallel  version  of  the  Discrete  Volume  Element  Method,  contains  an  initial 
stage  in  which  the  water  network  is  divided  into  several  parts  by  means  of  the 
Multilevel  Recursive  Bisection  graph  partitioning  method.  The  algorithm  has 
been  implemented  and  tested  on  a  cluster  of  PCs  with  the  MP1  system, 
achieving  good  performance  as  shown  in  the  results  included. 


1.  Introduction 

Computer  simulation  of  water  networks  by  means  of  mathematical  models  is 
nowadays  common  practice  in  most  water  companies,  being  an  indispensable  tool  for 
various  purposes.  In  particular,  computer  simulation  is  used,  among  other  objectives, 
to  guarantee  the  supply  of  the  required  water  flows  with  the  adequate  pressures, 
ensure  the  existence  of  water  stores  in  case  of  necessity,  comply  with  water  quality 
requirements,  reduce  energetic  costs  for  the  network  operation,  or  reduce  leakage. 

The  computational  tasks  related  to  the  analysis  of  water  networks  are  getting 
increasingly  complex,  due  to  various  factors.  First,  the  size  and  level  of  detail  of  the 
network  models  is  growing,  as  a  consequence  of  the  incorporation  of  data  from  GIS 
(Geographical  Information  Systems).  Second,  it  is  nowadays  increasingly  frequent  to 
be  concerned  with  complex  optimization  problems.  In  this  context,  it  has  become 
patent  the  need  of  more  powerful  computing  resources,  and  hence  the  interest  in  the 
use  of  parallel  computing. 

Consequently,  the  objective  of  the  HIPERWATER  project  (httpWhiperttn.upwes/ 
hiperwater)  was  to  introduce  High  Performance  Computing  in  the  simulation  and 
optimization  of  water  networks,  using  the  power  of  computing  clusters  to  speed-up 
those  tasks.  The  project  resulted  in  the  development  of  a  software  demonstrator,  based 


1  Partly  funded  by  the  European  Commission  through  the  PST  activity  HIPERWATER 
(ESPRIT  project  24003),  and  by  the  Spanish  Government  through  the  project  C1CYT  TIC96- 
1062-C03-01. 
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on  EPANET,  a  well  known  water  network  simulation  package  [5],  HIPERWATER 
tackles  three  different  problems  making  use  of  HPCN  solutions  [3],  [4]: 

•  Hydraulic  simulation.  The  problem  consists  in  obtaining  the  value  of  flows  and 
pressures  in  the  different  network  components.  The  equations  modelling  water 
networks  are  non-linear  and  therefore  require  an  iterative  solution. 

•  Water  quality  simulation.  By  solving  this  problem  information  is  obtained  about 
substance  concentrations,  water  age  analysis,  or  percentage  of  flow  from  a 
determined  source. 

•  Leakage  minimization.  The  objective  is  to  minimise  leakage  by  controlling 
pressures  with  a  number  of  Pressure  Reducing  Valves  (PRV).  This  is  done  by 
means  of  a  Sequential  Quadratic  Programming  algorithm. 

This  paper  is  devoted  to  the  second  problem  presented  above.  There  are  various 
methods  that  can  be  used  for  water  quality  simulation.  In  particular,  we  will  consider 
here  the  Discrete  Volume  Element  Method  (DVEM)  [6],  which  will  be  described  next. 


2.  The  Discrete  Volume  Element  Method 


A  water  distribution  network  is  viewed  as  a  collection  of  links  connected  together  at 
their  endpoints  called  nodes.  Links  can  be  of  different  types:  pipes,  pumps  or  valves. 
The  purpose  of  the  water  quality  simulation  is  to  track  the  fate  of  a  dissolved 
substance  flowing  through  the  network  over  time.  The  magnitude  and  direction  of 
water  flow  throughout  the  network  over  time  is  taken  as  input  data,  being  the  result  of 
the  hydraulic  simulation  problem.  In  particular,  we  consider  the  type  of  hydraulic 
simulation  known  as  extended  period  simulation,  which  divides  the  simulation  period 
in  a  sequence  of  time  steps,  and  in  each  of  them  the  flows  and  velocities  in  the  links 
are  assumed  to  be  constant. 

The  DVEM  is  formulated  assuming  a  one-dimensional  transport  model.  Within 
each  hydraulic  time  step,  a  shorter  water  quality  time  step  is  computed,  and  each  pipe 
is  divided  into  a  number  of  volume  segments  (elements).  Then,  advance  and  reaction 
of  the  substance  is  simulated  through  the  following  phases  (see  Fig.  1): 

•  Reaction.  The  reaction  of  the  substance  to  be  measured  is  simulated  in  this  phase, 
if  the  substance  is  reactive. 

•  Transport  into  nodes.  The  mass  of  substance  and  volume  of  water  of  the  last 
segment  of  each  pipe  is  accumulated  into  its  connecting  node.  Then,  new 
concentration  of  the  substance  on  each  node  is  computed. 

•  Transport  along  links.  Mass  is  shifted  from  volume  element  kiok+l  of  each  link. 

•  Transport  out  of  nodes.  Mass  is  moved  out  of  each  node  into  the  first  volume 
element  of  all  outgoing  links. 

This  sequence  of  phases  is  repeated  until  the  start  of  the  next  hydraulic  time  step. 
Then  the  water  quality  time  step  is  recomputed,  the  links  are  resegmented,  and 
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computation  continues.  The  method  is  fully  explicit,  in  the  sense  that  it  does  not 
require  the  solution  of  equation  systems. 


Original  Mass 


Cl  r— 

Reaction 

C  ) — 

— i — 

Transport  Into  Node 

- 1— 

— i _ 

Transport  Along  Link 
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Transport  Out  of  Node 

1~ 
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Fig.  1.  DVEM  phases  for  a  link  and  its  connecting  nodes. 


The  water  quality  time  steps  used  in  the  method  are  chosen  to  be  as  large  as 
possible  without  causing  the  volume  element  size  of  any  pipe  to  be  larger  than  the 
volume  of  the  pipe  itself.  Taking  into  account  that  the  volume  element  size  of  a  pipe 
is  given  by  Q,r  ,  where  Q,  is  the  flow  in  pipe  i  and  r  is  the  water  quality  time  step,  t 

must  be  chosen  as 

V  (!) 

X-min 

'  Qt 


where  K,  is  the  volume  of  pipe  i.  The  quotient  \‘(j.  is  referred  to  as  the  travel  time  of 
pipe  i.  Then,  the  number  of  volume  segments  in  each  pipe  is 


where  |_xj  represents  the  largest  integer  less  than  or  equal  to  x. 


3.  Parallel  DVEM  algorithm 


Different  water  quality  time  steps  must  be  performed  sequentially  in  time,  due  to  the 
fact  that  the  solution  of  a  step  requires  the  results  of  the  previous  one.  Thus,  the 
parallelization  of  the  water  quality  process  must  be  based  on  a  parallel  algorithm  for 

each  individual  step.  , 

In  order  to  do  so,  we  first  divide  the  water  network  into  several  parts,  one  for  each 
processor  in  our  system.  This  initial  network  partitioning  plays  an  important  role  to 
minimise  communications  and  balance  the  computational  load.  Two  are  the  desired 
objectives  to  be  accomplished  by  the  partitioning  algorithm:  a  similar  number  of 
elements  (nodes)  should  be  assigned  to  each  processor,  and  the  number  of  pipes  with 
nodes  belonging  to  different  processors  should  be  minimum.  The  network  can  be 
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considered  as  a  graph  where  the  vertices  are  given  by  the  nodes  and  the  edges  of  the 
graph  are  the  pipes  and  valves  of  the  network. 

In  particular,  the  approach  used  is  known  as  Multilevel  Recursive  Bisection 
technique  [1],  [2].  Since  the  partition  of  the  network  is  carried  out  only  once  and  it  is 
not  a  time-consuming  task  (in  the  test  networks  the  time  involved  is  less  than  a 
quarter  of  a  second),  a  serial  version  of  this  algorithm  has  been  applied. 

This  algorithm  works  in  the  following  way.  First,  a  coarsening  phase  is  performed, 
where  the  size  of  the  graph  to  be  partitioned  is  reduced,  by  collapsing  vertices  and 
edges.  This  reduction  is  repeated  until  a  graph  with  a  few  hundred  vertices  is 
obtained.  Then,  in  the  partitioning  phase  a  bisection  of  the  small  graph  is  carried  out, 
and  two  subgraphs  are  obtained,  with  a  minimum  number  of  edges  interconnecting 
them,  and  a  similar  amount  of  vertices  in  each  subgraph.  Finally,  the  uncoarsening 
phase  takes  place,  where  the  objective  is  to  project  back  the  partition  to  the  original 
graph,  by  means  of  a  successive  refining  process. 

This  complete  process  leads  to  a  good  partition  for  the  graph  in  a  fast  way.  It  must 
be  noted  that  the  graph  partitioning  determines  how  the  nodes  are  assigned  to  each 
processor,  but  nothing  is  said  about  the  distribution  of  the  pipes.  As  one  would 
expect,  a  pipe  will  belong  to  the  processor  owning  their  end  nodes.  If  the  two  end 
nodes  belong  to  different  processors,  the  pipe  will  be  arbitrarily  assigned  to  any  of 
them.  Actually,  this  means  that  a  frontier  between  network  parts  crosses  nodes  and 
not  pipes,  although  the  associated  frontier  in  the  graph  crosses  edges  and  not  vertices. 
Whenever  a  graph  frontier  crosses  an  edge,  the  network  frontier  is  moved  to  one  of 
the  two  end  nodes  of  the  corresponding  pipe.  We  refer  to  the  nodes  situated  in  a 
network  frontier  as  shared  nodes. 

With  the  water  network  distributed  among  the  processors,  the  parallel  algorithm  for 
the  basic  quality  time  step  is  largely  given  by  the  sequential  one  applied  in  each 
processor  to  the  corresponding  local  portion  of  the  network.  Of  course,  some  extra 
communication  operations  will  have  to  be  carried  out,  since  the  different  network 
portions  are  not  independent  of  each  other.  In  particular,  in  order  to  perform  the  phase 
of  “transport  into  nodes”  for  shared  nodes,  each  processor  has  a  local  instance  of  these 
nodes  into  which  the  transport  is  done,  obtaining  the  local  values  of  mass  and  volume. 
After  this  phase,  a  communication  operation  is  required  in  which  the  local 
contributions  of  the  shared  nodes  are  combined  to  obtain  the  final  mass  and  water 
volumes,  values  which  are  then  sent  back  to  the  processors  sharing  the  nodes  (this 
communication  operation  is  implemented  by  means  of  the  MPI  function 
MPI_Allreduce).  The  rest  of  the  phases  in  the  sequential  DVEM  algorithm  are  not 
altered. 

On  the  other  hand,  the  process  of  computing  the  water  quality  time  step  is  done  by 
computing  locally  the  minimum  travel  time  for  each  network  portion,  then  obtaining 
the  minimum  of  these  values  (this  involves  again  an  MPl_Allreduce  operation). 


4.  Results 


The  parallel  algorithm  for  the  water  quality  simulation  has  been  tested  over  a 
platform  formed  by  several  Pentium  PRO  200  MHz  PCs  with  Windows  o.s. 
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connected  via  a  Fast  Ethernet  network.  Two  water  networks,  named  Test  A  and  Test 
B,  have  been  used  for  the  testing.  Their  main  characteristics  can  be  seen  in  Table  1. 


Table  1.  Characteristics  of  the  test  networks. 


Pipes 

Nodes 

Tanks 

Substance 

Simulation  duration 

Test  A 

4901 

2501 

1 

Chlorine 

48.00  hrs 

Test  B 

19801 

10001 

1 

Chlorine 

24.00  hrs 

Execution  times  obtained  with  these  test  networks  are  shown  in  Fig.  2,  which  also 
includes  the  execution  times  of  the  original  sequential  EPANET  l.le  simulation 
program. 

The  resulting  speed-up  is  shown  in  Fig.  3.  Here,  the  speed-up  values  are  taken  with 
respect  to  the  sequential  simulation  program  EPANET,  in  order  to  get  the  real  gain  in 
execution  time  that  has  been  achieved.  A  speed-up  of  up  to  3.1  has  been  obtained, 
which  illustrates  the  good  performance  achieved  with  the  parallelization. 


Test  2.3  Test  2.4 

Fig.  2.  Parallel  algorithm  execution  times,  in  seconds. 


Test  A  Test  B 

Fig.  3.  Parallel  algorithm  speed-up. 

Finally,  efficiency  obtained  can  be  seen  in  Table  2.  In  this  case  the  efficiency  is 
obtained  with  respect  to  the  parallel  algorithm  executed  on  a  single  processor.  It  can 
be  seen  that  Test  B  presents  lower  efficiencies  than  Test  A,  although  Test  B 
corresponds  to  a  larger  network.  This  is  due  to  the  time  spent  on  reading  and 
distributing  hydraulic  results,  and  collecting  and  writing  the  final  results,  which  is  a 
process  that  must  be  done  at  the  beginning  and  the  end  of  each  hydraulic  time  step. 


□  1  proc. 

□  2  proc. 

□  3  proc. 
04  proc. 


OEpanet 

□  1 

proc. 

□  2 

proc. 

03 

proc. 

□  4 

proc. 
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Table  2.  Parallel  algorithm  efficiency. 


Eficiency 

2  proc. 

3  proc. 

4  proc. 

Test  A 

0,82 

0,77 

0,68 

Test  B 

0,83 

0,64 

0,54 

5.  Conclusions 

A  parallel  algorithm  for  the  quality  simulation  of  drinking  water  networks,  based  on 
the  DVEM  method  implemented  in  the  EPANET  package,  has  been  presented.  The 
proposed  method  allows  for  efficient  simulation  of  the  spatial  and  temporal 
distribution  of  substances  in  water  networks. 

The  algorithm  has  been  developed  in  the  frame  of  the  HIPERWATER  project.  The 
objective  of  HIPERWATER  has  been  to  meet  the  need  of  computational  power  by 
introducing  High  Performance  Computing  techniques.  The  project  considered  the 
problems  of  hydraulic  simulation  and  leakage  minimization,  as  well  as  the  water- 
quality  simulation. 

Concerning  the  water-quality  algorithm  presented  here,  results  obtained  show  an 
important  reduction  in  the  computation  time  with  respect  to  the  EPANET  package. 
The  paper  shows  that  High  Performance  Computing  is  a  valuable  tool  for  the 
reduction  of  time  spent  on  quality  simulations  for  large  drinking  water  networks. 
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Abstract.  An  exhaustive  library  of  sparse  iterative  methods  and  pre¬ 
conditioners  in  HPF  was  developed,  and  a  tool  to  predict  and  visualize 
the  performance  of  these  codes  is  presented.  This  tool  can  be  used  both 
by  the  users  and  by  the  library’s  developers  to  optimise  the  efficiency  of 
the  codes,  as  well  as  to  simplify  their  use.  The  information  offered  by 
this  tool  combines  theoretical  features  of  the  methods  and  precondition¬ 
ers  in  addition  to  some  practical  considerations  and  predictions  about 
performance  aspects  of  their  execution. 


1  Introduction 

The  complexity  of  parallel  systems  makes  a  priori  performance  prediction  diffi¬ 
cult.  In  fact,  performance  instrumentation  and  visualization  in  parallel  systems 
was  found  to  be  a  complex  multidimensional  problem  [9].  A  performance  data 
collection,  analysis  and  visualization  environment  is  needed  to  detect  the  effects 
of  architectural  and  system  software  variations. 

The  reasons  for  poor  performance  of  codes  on  distributed  memory  systems 
can  be  varied,  and  users  need  to  be  able  to  understand  and  correct  performance 
problems.  This  fact  is  specially  relevant  when  high  level  libraries  and  program¬ 
ming  languages  are  used  to  implement  parallel  codes,  as  in  the  case  of  HPF 
{7}. 

Most  of  the  performance  tools,  both  research  and  commercial,  focus  on  low 
level  message-passing  platforms  like  MPI  or  PVM  [4]  [5]  [1],  and  the  most  preva¬ 
lent  approach  taken  by  these  tools  is  to  collect  performance  data  during  program 
execution  and  then  provide  post-mortem  display  and  analysis  of  performance 
information  [10]  [11].  Our  proposal  is  different,  we  present  a  tool  that  predicts 
performance  data  of  irregular  HPF  codes  before  executing  them. 

The  efficient  implementation  of  irregular  codes  in  HPF  is  hard.  However,  sev¬ 
eral  techniques  for  handling  this  problem  using  intrinsic  and  library  procedures 
as  well  as  data  distribution  directives  can  be  applied.  An  exhaustive  library  of 
iterative  methods  and  preconditioned  was  developed  [3],  the  tool  presented  in 
this  paper  analyses  the  performance  of  these  codes.  This  tool  can  be  used  both  by 
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the  users  of  this  library  to  optimize  the  efficiency  and  by  the  library’s  developers 
to  check  the  inefficiencies  in  an  easy  to  use  interface. 

Several  strategies  were  used  to  optimize  the  efficiency  of  these  parallel  codes. 
In  the  literature,  many  iterative  methods  have  been  presented  and  developed  and 
it  is  impossible  to  cover  them  all.  We  chose  the  methods  below,  either  because 
they  represent  the  current  state  of  the  art  for  solving  large  sparse  linear  systems 
[2]  or  because  they  present  special  programming  features.  The  methods  we  con¬ 
sider  are:  Conjugate  Gradient  (CG),  Biconjugate  Gradient  (BiCG),  Biconjugate 
Gradient  Stabilized  (BiCGSTAB),  Conjugate  Gradient  Squared  (CGS),  Gen¬ 
eralized  Minimal  Residual  (GMRES),  Jacobi,  Quasi-Minimal  Residual  (QMR) 
and  Gauss-Seidel  Successive  Over-Relaxation  (SOR).  Additionally,  some  pre¬ 
conditioners  were  also  implemented  in  HPF,  and  can  be  applied  to  the  target 
sparse  matrix  to  transform  it  into  one  with  a  more  favourable  spectrum.  These 
preconditioners  are:  the  Jacobi  preconditioner,  the  Symmetric  Successive  Over- 
Relaxation  (SSOR),  the  Incomplete  LU  factorization  (ILU(O)),  the  Incomplete 
LU  factorization  with  threshold  (ILUT),  the  Neumann  Polynomial  precondi¬ 
tioner  and  the  Least  Squares  Polynomial  preconditioner. 

The  system  on  which  we  implemented  the  parallel  codes  was  the  Fujitsu 
AP3000,  a  distributed  memory  multiprocessor  which  consists  of  12  UltraSparc 
processors  connected  by  a  mesh  network  [8].  However,  both,  the  parallel  codes 
and  the  performance  tool,  can  be  directly  used  on  other  parallel  and  distributed 
platforms  with  minor  changes  if  any. 


2  The  visualization  tool 

Some  knowledge  about  the  linear  system  is  needed  to  guarantee  convergence  of 
these  algorithms,  and  generally  the  more  that  is  known  the  more  the  algorithm 
can  be  tuned.  Thus,  we  have  chosen  to  present  an  algorithmic  outline,  with 
guidelines  for  choosing  a  method  as  part  of  our  tool. 

A  method  that  works  efficiently  for  one  problem  may  not  work  so  good 
for  another.  This  problem  increases  in  complexity  if  the  application  of  some 
preconditioner  is  also  considered.  The  tool  presented  in  this  report  helps  to 
find  the  most  effective  method  for  the  matrix  in  hand  avoiding  the  need  of  an 
exhaustive  searching. 

Our  proposal  combines  theoretical  features  of  the  methods  and  precondi¬ 
tioners  in  addition  to  some  practical  considerations.  In  this  way,  relationships 
between  data  become  readily  apparent  when  the  data  are  graphically  displayed. 
The  tool  aids  users  in  understanding,  and  drawing  conclusions  from  the  iterative 
methods  and  their  implementation  in  HPF  for  each  particular  matrix. 

The  goals  obtained  by  our  prototype  are: 

-  The  tool  allows  users  to  select  interactively  the  data  to  be  displayed. 

-  The  tool  is  easy  to  install  and  its  use  is  fairly  self-explanatory. 

-  It  includes  tools  for  gathering  performance  information. 

-  The  individual  analysis  and  visualization  components  are  easy  to  build  for 
many  different  matrices  and  preconditioners. 
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-  As  a  standard  platform,  TCL/TK,  was  used  to  implement  the  tool,  new 
components  or  modifications  can  be  added. 

-  The  tool  is  fast  because  the  great  amount  of  data  required  is  filtered. 

-  It  provides  great  functionality  as  it  uses  a  Xwindows  platform. 

-  It  supports  multiple  analysis  levels,  including  the  sparse  matrix  characteris¬ 
tics,  the  methods,  and  the  performance  analysis  and  predictions. 

-  It  is  portable  to  systems  including  a  TCL/TK  library,  providing  portability 
across  a  great  variety  of  computers. 

-  Although  the  target  platform  is  the  Fujitsu  AP3000,  the  tool  can  be  easily 
adapted  to  analyse  other  multiprocessors. 

The  number  of  generated  events  is  potentially  enormous.  The  environment 
includes  a  set  of  data  filters  that  process  the  input  data  reducing  their  number. 
Via  an  environment  control,  the  display  can  be  changed  dynamically,  allowing 
the  user  to  select  the  best  suited  display  formats  to  the  data. 

The  diversity  of  the  performance  data  demands  an  equally  rich  set  of  perfor¬ 
mance  displays.  The  displays  include:  dials,  bar  charts,  LEDs,  Kiviat  diagrams, 
matrix  views,  X-Y  plots,  3-dimensional  plots  and  text  information. 

The  user  interface  for  the  prototype  visualization  system  provides  compre¬ 
hensive  control.  Through  menus  the  user  can  obtain  valuable  information  about 
the  execution  of  the  iterative  method  and  preconditioner  to  select  the  best  one 
in  a  friendly  environment. 

The  main  capabilities  of  this  visualization  tool  are: 

-  It  loads  the  sparse  matrix  in  a  standard  format  [6]  and  determines  its  es¬ 
sential  characteristics,  like,  the  pattern,  the  sparsity,  the  bandwidth,  the 
symmetry,  etc. 

-  Theoretical  aspects  about  the  application  of  the  iterative  methods  and  pre¬ 
conditioners  to  the  matrix. 

-  The  number  of  floating  point  operations  required  for  each  iteration  of  the 
methods  and  preconditioners  for  both,  the  sequential  and  the  HPF  codes. 

-  The  load  balance  in  terms  of  the  computational  costs. 

-  The  number  of  communications  and  their  lengths.  This  information  can  be 
shown  for  each  processor. 

-  A  prediction  about  the  execution  time  for  each  iteration. 

-  As  the  number  of  iterations  required  by  any  method  can  not  be  predicted,  a 
small  number  of  iterations  could  be  executed  in  order  to  analyse  changes  in 
the  residuals,  and  get  a  first  approach  about  the  convergence  of  each  method 
and  preconditioner. 

-  Detailed  statistical  information  about  a  routine  can  be  seen. 

-  The  use  of  pull-down  menus  to  select  visualization  displays,  or  to  change 
options  is  available. 

-  The  statistics  display  shows  the  cumulative  data  for  the  complete  parallel 
code  or  for  each  process. 

-  Finally,  the  method  and  the  preconditioner  can  be  actually  executed. 
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The  snapshots  in  Figure  1  show  an  example  of  the  tool,  in  particular  it  shows 
the  main  menus,  the  menu  for  selecting  a  sparse  matrix  from  a  file,  the  pattern 
of  this  matrix,  the  help  window  and  the  window  for  selecting  the  number  of 
processors  to  execute  the  parallel  code. 

And,  in  Figure  2,  note  the  performance  consultant  window  that  shows  the 
statistics  for  each  process,  a  Kiviat  graph  showing  the  load  balance,  the  his¬ 
togram  of  the  number  and  length  of  the  messages  to  be  sent  and  received  by 
each  processor,  and  the  window  for  selecting  the  iterative  method  and  precon¬ 
ditioner. 
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Fig.  1.  Example  of  use  of  the  visualization  tool. 
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Fig.  2.  Example  of  some  results  shown  by  the  visualization  tool. 
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Abstract.  We  present  a  systematic  and  simple  methodology  to  design 
parallel  algorithms  to  solve  the  Generalized  Sylvester  Equation  and  other 
linear  matrix  equations.  The  resulting  algorithms  are  well  suited  to  be 
implemented  using  standard  libraries  of  matrix  arithmetic  routines. 


1  Introduction. 

The  solution  of  the  Generalized  Sylvester  Equation,  AXB  +  CXD  =  E,  with 
.4  G  €  Rmxrn,  B,D  €  Rnxn  and  X,  E  €  Rmxn ,  has  wide  application  in  modern 
Linear  Control  Theory  [7], [8], [13].  When  addressing  particular  problems,  simpler 
equations  derived  from  it,  as  the  Sylvester[8],[13],[2],  Lyapunov  [14]  and  Stein 
[8],  [13]  equations  are  also  frecuently  used. 

In  this  paper  we  introduce  a  systematic  and  simple  design  methodology  to 
solve  them,  deriving  algorithms  directly  expressed  in  terms  of  basic  operations 
of  Linear  Algebra  [4],  so  they  can  be  easily  implemented  using  standard  sci¬ 
entific  libraries.  The  methodology  is  based  on  the  definition  of  the  Kronecker 
Product,  presented  in  section  2,  and  on  the  Back  Substitution  Algorithm.  The 
parallelization  of  this  algorithm  and  the  basic  operations  of  Linear  Algebra  is 
widely  studied  [4],[1],[15],  so  the  methodology  allows  to  systematically  obtain 
parallel  implementations  of  the  resulting  algorithms.  The  proposed  methodol¬ 
ogy  has  been  already  tested  in  the  design  of  a  library  of  systolic  routines  [10], 
using  dynamic  arrays  and  applying  the  DBT  transformation  [12]  on  the  basic 
operations. 

To  simplify  the  description  of  the  methodology  with  a  practical  example  in 
section  3,  we  will  assume  a  triangular  or  quasi  triangular  form  of  the  equation. 
Results  for  the  general  case  are  presented  in  [9], [11]. 

2  The  Kronecker  Product  and  the  Vector  Function. 

Given  the  A  =  [oy]  €  Rmxm  and  B  =  [btJ]  £  Rnxn  matrices,  the  Right 
Kronecker  Product  of  A  and  B,  written  A  <g>  B,  is  defined  as  the  block 
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matrix, 


an B  ai2 B  ■  ■  ■  aimB 
a2i B  a-it-B  a-2mB 

ami  B  am  2B 

&mm  B 


[aijB]  £  Rn 


Given  .4  £  Rmxm,  A  -  (A:<1,  A:i2,  . . . , ,4:,n)  where  A:>1  £  Rm,  with  i  =  1, 
2,. . . ,  n,  the  vector 

(  i  •;  ) 

I  '  •  1 2  I  7-»m  nv1 


is  called  Vec-function  of  A  and  written  vec(A).  Among  the  properties  of  the 
Vec-function,  the  following  two  [5] 

1.  V  A,  B  £  Rmxn  and  V  a,0  €  R,  vec{aA  +  ,3B)  =  avec(A)  +  /3vec(B ) 

2.  If  .4  £  RmXm,  B  6  Rnxn  and  X  £  Rmxn,  then  vec(AXB)  =  ( BT  ®  A)vec(X), 

will  allow  to  use  the  Kronecker  Product  as  a  tool  to  solve  the  studied  matrix 
equations  and  design  the  corresponding  algorithms. 

3  Application  of  the  Kronecker  Product  and  the. 
Vec-Function. 

Previous  to  the  application  of  the  methodology,  the  problem  is  transformed  into 
a  condensed  form  according  to  the  method  proposed  by  Golub,  Nash  and  Van 
Loan  [3].  Applying  the  previous  definitions  to  the  Triangular  case  of  equation 
AXB  +  CXD  =  E,  we  obtain  the  linear  equation  system  ( BT  ®  A  +  DT  ® 
C)vec(X)  =  vec(E),  shown  in  figure  1.  Its  block  structure  suggests  the  use  of 
the  Back  Substitution  Algorithm  to  solve  the  problem  [6]  but  this  implemen¬ 
tation  has  always  been  discarded  due  to  the  huge  size  of  the  resulting  system. 
Our  methodology  uses  the  Kronecker  Product  in  the  design  phase  to  study  the 
structure  of  the  resulting  system,  having  figure  1  as  the  starting  point  for  the 
design  of  the  algorithms.  Adapting  the  Back  Substitution  to  the  corresponding 
block  structure,  we  obtain  the  SGT  Algorithm  shown  in  figure  2,  that  uses 
basic  operations  of  the  Linear  Algebra:  Solve  a  system,  Gaxpy  and  Saxpy. 

For  the  Quasi- Triangular  case,  we  assume  that  the  pencil  A-XC  is  reduced  to 
the  Real  Schur  Form,  and  the  pencil  D  -  A B  to  the  Triangular  Form:  each  block 
in  figure  1  is  a  Schur  matrix.  The  main  difference  is  now  that  the  matrix  (Abu  + 
Cdu)  must  be  tnangularized  before  solving  the  value  of  xt.  So  adding  to  the  SGT 
Algorithm  the  new  operation  Calculate  Q  :  (Abu+Cdi{)Q  is  upper  triangular, 
whose  outputs  are  the  matrices  AQ,  CQ  (with  the  same  zero-structure  that 
matrix  A)  and  Q  (tridiagonal),  we  obtain  the  SGH  Algorithm,  shown  in  figure 
3.  The  election  of  a  column-oriented  transformation  has  been  made  to  optimize 
the  data  flow  for  a  systolic  implementation. 
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Pig.  1.  Linear  Equation  System  obtained  by  applying  the  Kronecker  Product  and  the 
Vec-function  to  the  Triangular  Generalized  Sylvester  Equation. 


for  i : =n  downto  1  do 

x[i] :=b[i] /a[i, i] ; 

for  j : = i — 1  downto  1 

b[ j 1 :=b[ j ] -a[ j , 

endfor 

enfor; 


do 

i] *x[i] 


for  i : =n  downto  1  do 

Solve  (Ab^+Cd^ix^e^; 

W:=A*X 


V:  =C*X 


1 ' 
i ' 


for  j:=i-l  downto  1  do 

Update  ej :=ej-wbij-vd^j 

endfor 

enfor; 


Fig.  2.  Transformation  of  the  Back  Substitution  Algorithm  into  the  SGT  Algorithm. 


for  1 : =n  downto  1  do 

Calculate  Q:  (Abii+Cdii)Q  is  upper  triangular; 

Solve  ( (Abii+Cdii)Q) (QTxi)=ei; 

w;  =  (AQ)MQTxi); 

v:=MCQ!*(QTxi)  ; 

xi:=QMQTxi)  ; 

for  j : =i -1  downto  1  do 

Update  :=e;j-wbi;j-vdi;j 

endfor 
enfor ; 


Fig.  3.  SGH  Algorithm. 
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3.1  Obtaining  Block  Algorithms. 

Starting  from  figure  1  it  is  also  possible  to  design  algorithms  for  N  x  N  upper 
triangular  blocks  of  size  M  x  M  each,  being  N  =  pn  and  M  =  qm.  Each  of 
these  generic  blocks,  Abi3  +  Cdij,  with  j=l..N  and  i=j..N,  presents  the  structure 
shown  in  figure  4:  they  are  built  of  q  x  q  subblocks  of  size  m  x  m.  Therefore 
each  of  the  columns  of  X  and  E  is  also  built  of  q  blocks  of  size  m.  We  will 
denote  the  subblock  at  the  r  row  and  s  column  from  the  (Abij  +  Cdij)  block 
as  {Arsbij  +  Crsbij );  and  the  rth  subvector  from  the  ith  column  of  X,  xi:  or  E, 
e,:,  as  x\  or  e[,  respectively.  From  the  described  block  structure,  it  is  possible  to 


Fig-  F  Structure  of  each  triangular  block  from  the  coefficient  matrix. 


obtain  different  algorithmic  schemes.  After  studying  several  possibilities  for  the 
Triangular  case,  we  have  chosen  to  rewrite  block-oriented  versions  of  the  Solve, 
Gaxpy  and  Update  operations.  The  two  obtained  algorithms,  shown  in  figure 
5,  are  called  SGTB2.1  (column-oriented)  and  SGTB2.2  (row-oriented). 


for  i : =N  down to  1  do 

for  s :=q  downto  1  do 

ws : =0 ;  Vs : =0 

endfor 

for  s : =q  downto  1  do 


Solve  (Assbii+Cssdii)xis=eis; 
for  r:=s  downto  1  do 

wr : =wr +ArS S ; 

vr : =vr+Crs*xis 

endfor ; 

for  j : = i - 1  downto  1  do 


endfor 
endfor 
I  endfor; 


Update  e j  s : =e j  s -wsbi ^ -vsdi ^ 


(a) 


for  i : =N  downto  1  do 

for  s ; =q  downto  1  do 

ws:=0;  Vs : =0 ; 

for  r:=q  downto  s+1  do 

ws : =ws +Asr*x^r; 

VS:=VS+CSr*Xir 

endfor; 

eis:=eis-Wsbii-vsdii  ; 

Solve  (ASSbii+Cssdii)xis=ei£ 
ws : =wS+Ass*x^S; 
VS:=VS+CSS*X1S; 
for  j:=i-l  downto  1  do 


endfor 
endfor 
endfor 


Update  ejS;=ejS-wSbij-vsdi;j 

(b) 


Fig.  5.  (a)  SGTB2.1  Algorithm,  (b)  SGTB2.2  Algorithm. 
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The  main  difference  between  the  Triangular  and  the  Quasi-  Triangular  cases 
resides  on  the  division  of  each  Schur  block  to  obtain  a  block-oriented  version  of 
the  operation  Calculate  Q.  We  must  apply  the  division  depicted  in  figure  6:  two 
consecutive  blocks  in  a  row,  ( Arsba  +  Crsda)  and  (.4r,s+ ’■£>;;  +  Cr'!,+l da),  share  a 
column,  to  correctly  nullify  the  subdiagonal  elements.  The  resulting  SGHB1.1 
and  SGHB1.2  algorithms  are  shown  in  figure  7.  The  blocks  affected  for  this 
special  division  are  marked  with  bold  type.  Note  that  although  the  real  size  of 
each  submatrix  (( Assbu  +  CssdH)Qs),  in  the  operation  Solve  is  m  x  (m  +  1), 
the  result  (QS)T x\  must  be  of  size  m.  Therefore  some  updates  are  deferred  until 
the  corresponding  element  is  calculated. 


Fig.  6.  (a)  Block  division  for  Solve  and  Gaxpy  operations,  (b)  Block  division  for  Cal¬ 
culate  Q  and  Apply  Q  operations. 


4  Conclusions 

We  have  presented  a  methodology  that  allows  the  simplification  of  a  complex 
problem  to  be  solved  using  basic  Linear  Algebra  operations  and  implementing 
the  solution  using  standard  libraries,  and  that  can  also  be  used  to  obtain  block 
algorithms.  The  methodology  itself  is  a  powerful  graphical  tool  that  helps  the 
design  by  offering  a  clear  representation  of  the  data  flow  and  dependencies. 
Therefore  the  data  flow  is  adapted  to  the  processing  requirements,  eliminating 
the  need  for  intermediate  storage  resources.  The  methodology  has  been  applied 
to  other  equations  [9]  obtaining  a  reduced  set  of  basic  arrays  that  form  a  complete 
Systolic  Library  [10]  for  solving  a  wide  variety  of  problems  in  the  field  of  matrix 
algebra  (see,  e.g.,  [11]). 
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Abstract  This  paper  describes  the  design  and  implementation  of  dthread, 
a  new  general  purpose  user-level  threads  package,  designed  to  support 
fine-grain  parallel  applications  in  a  portable  and  efficient  way.  We  de¬ 
cided  to  build  this  new  library  because  the  performance  of  the  Solaris 
threads  library  is  not  good  enough  to  support  fine-grain  parallel  appli¬ 
cations.  We  include  some  measurements  comparing  the  performance  of 
both  libraries.  They  show  our  objective  has  been  reached. 

Topics:  Parallel  and  distributed  algorithms,  Operating  systems. 
Keywords:  Threads,  Parallelism,  Solaris,  Multiprocessors. 


1  Introduction 

This  work  is  part  of  a  large  project:  VAMOS,  “VHDL  Advanced  Multiprocessor 
Optimized  Simulation”,  developed  by  the  Computer  Architecture  Department 
(UPM)  and  the  TGI  company.  The  objective  of  VAMOS  was  to  develop  a  VHDL 
parallel  simulator  for  shared  memory  multiprocessors.  This  parallel  simulator 
runs  on  Solaris  multiprocessors  and  uses  fine  and  very  fine-grain  parallelism. 

Due  to  the  poor  performance  we  observed  in  the  Solaris  threads  library  with 
this  kind  of  parallelism,  we  decided  to  develop  our  own  threads  library  to  improve 
the  performance.  We  have  got  a  small,  efficient,  portable  and  standard  threads 
library  suitable  for  fine- grain  parallelism. 


2  State  of  the  art 

Initially  threads  were  lightweight  processes  executing  in  a  single  address  space 
that  could  run  independently  and  concurrently.  They  were  managed  in  the  op¬ 
erating  system  kernel  (kernel  threads),  which  made  threads  expensive. 

Later  on  user-level  threads  were  introduced  ([1,2]).  They  have  performance 
and  flexibility  advantages  over  kernel  threads  because  they  are  managed  within 
the  user  address  space.  But  they  have  also  disadvantages  when  a  user-level  thread 
performs  blocking  system  calls  or  in  presence  of  multiprogramming.  These  prob¬ 
lems  arise  because  there  are  two  places  where  the  next  running  thread  can  be 
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scheduled,  one  in  the  application  and  another  one  in  the  operating  system,  with 
very  little  coordination  among  them.  This  is  called  the  two-level  scheduling  prob¬ 
lem  ([3-7]). 

At  the  moment,  the  threads  are  supported  by  most  of  the  operating  systems, 
including  Solaris,  they  are  well-known  and  there  is  a  standard  ([8])  that  assures 
their  portability. 


3  Solaris  threads  library 

Solaris  has  two  kind  of  threads:  kernel-supported  threads  so  called  Light  Weight 
Processes  (LWP)  and  user-level  threads,  simply  called  threads.  User-level  threads 
are  used  to  decrease  the  level  of  overheads  involved  in  their  management  (cre¬ 
ation,  destruction,  context  switch,...).  On  the  other  hand,  Solaris  uses  kernel 
threads  as  virtual  processors  to  execute  the  user-level  threads  and  to  control  the 
degree  of  real  concurrency  that  the  application  requires. 

Each  LWP  is  independently  dispatched  by  the  kernel.  They  may  run  in  paral¬ 
lel  on  a  multiprocessor,  being  scheduled  onto  the  available  processors  according 
to  their  scheduling  class  and  priority.  Threads  are  implemented  by  the  library 
and  are  not  known  by  the  kernel. 

We  have  found  several  problems  in  the  Solaris  user-level  threads  library  that 
encouraged  us  to  develop  a  new  one: 

-  Heavy  weight.  User-level  threads  are  quite  heavy,  being  suitable  for  coarse- 
grain  and  middle-grain  applications  but  never  for  fine-grain  applications, 
because  context  switch  involves  heavy  system  calls  that  make  context  switch 
time  usually  longer  than  the  tasks  execution  time. 

Degree  of  concurrency.  The  library  changes  dynamically  and  transpar¬ 
ently  the  number  of  LWPs  that  give  support  to  an  application  to  solve  the 
two  level  scheduling  problem.  When  all  the  LWPs  in  the  process  are  blocked 
in  indefinite  waits  the  kernel  sends  a  signal  to  the  threads  library  that  re¬ 
sponds  creating  a  new  LWP.  Also  the  threads  library  makes  LWPs  to  “ages” 
and,  if  they  are  not  used  for  a  long  time,  they  are  terminated.  That  means 
the  user  has  a  loose  control  over  the  actual  degree  of  concurrency  that  is 
effective  only  in  simple  and  small  applications  but  has  no  control  in  a  real, 
big  application. 

-  Poor  locality  of  reference.  The  library  puts  all  the  runnable  threads 
together  in  a  global  queue.  The  LWPs  always  choose  the  first  one,  which 
implies  a  poor  behavior  in  terms  of  locality  of  reference. 

-  Bad  optimizations.  Some  library  optimizations  are  very  dependent  on 
the  application  and  quite  often  they  decrease  the  performance  instead  of 
increasing  it;  For  example,  when  a  thread  becomes  blocked  and  there  are 
no  more  runnable  threads,  the  LWP  that  was  running  the  thread  must  also 
stop  running.  It  does  so  by  waiting  on  an  LWP  semaphore  associated  with 
the  thread  (the  LWP  is  parked),  rather  than  idling  on  the  global  condition 
variable.  This  practice  optimizes  the  case  where  the  blocked  thread  becomes 
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runnable  quickly,  but  leaves  the  application  without  one  LWP  for  some  time. 
Even  it  is  possible  to  find  some  runnable  threads  waiting  for  a  LWP  while 
there  are  some  parked  LWPs. 


4  Dthreads  library 

Once  we  identified  the  above  mentioned  problems,  we  concluded  that  the  Solaris 
threads  library  was  not  appropriate  for  fine-grain  parallelism.  The  best  choice 
was  to  replace  it  with  a  new  one,  suitable  for  fine-grain  parallelism.  The  principal 
goals  were: 

-  Efficiency.  This  was  the  main  objective.  To  accomplish  it  we  reduced  the 
threads  weight  to  the  minimum. 

-  Threads  management  is  done  exclusively  in  the  user  address  space,  with¬ 
out  system  calls. 

-  Portability.  The  dthreads  library  is  POSIX  compliant  to  ensure  portabil¬ 
ity.  Both  dthreads  and  Solaris  threads  libraries  can  be  used  by  the  VHDL 
simulator  by  linking  the  chosen  one. 

-  Degree  of  concurrency.  The  dthreads  library  leaves  the  control  of  the 
degree  of  concurrency  that  the  application  needs  to  the  user  and  the  modi¬ 
fications  done  are  not  transparent  to  the  user. 

-  Locality  of  reference.  This  new  library  tries  to  avoid  thread  migration 
between  LWPs,  improving  the  efficiency  of  caches.  To  achieve  this  objective 
it  has  one  local  queue  by  processor  and  a  global  queue. 

-  Avoid  blocking  system  calls  as  much  as  possible.  If  a  user-level  thread 
executes  a  blocking  system  call,  the  underlying  kernel  thread  blocks  too.  In¬ 
side  the  application  a  virtual  processor  is  lost  and  a  physical  one  can  be 
unused  even  if  there  are  runnable  user- level  threads.  In  order  to  avoid  pro¬ 
cessors  being  idle  (two- level  scheduling  problem),  the  library  implements  a 
buffered  input/output  monitor.  This  monitor  could  also  have  been  imple¬ 
mented  with  the  Solaris  threads  library.  The  idea  is  to  assure  that  most  of 
the  times  the  application  threads  will  not  block.  The  read  and  write  opera¬ 
tions  are  done  by  the  monitor. 

The  architecture  of  the  system  based  on  dthreads  is  quite  similar  to  the 
original  one.  The  new  user-level  threads  executes  on  the  LWPs,  that  are  used  as 
virtual  processors. 

Each  thread  has  a  stack,  an  optional  heap  and  a  thread  control  block.  The 
thread  control  block  holds  the  thread  identifier,  a  pointer  to  the  stack  and  the 
thread  context. 

There  is  the  possibility  of  assigning  a  heap  to  each  thread  to  avoid  contention 
in  the  dynamic  memory  allocation  for  high  demanding  memory  applications. 
Threads  can  share  memory  dynamically  allocated  in  any  heap,  but  each  thread 
must  manage  his  own  heap,  allocating  and  freeing  memory.  If  a  thread  doesn  t 
know  which  one  will  free  a  memory  block,  it  must  use  the  global  heap,  that 
is  lock  protected  to  prevent  concurrent  access.  There  is  no  loss  of  generality 
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because  local  heaps  are  only  an  extension.  The  global  heap  is  always  present 
and  his  access  is  protected. 


5  Results 

Now  we  will  show  some  measurements  that  justify  the  advantages  of  this  new 
user-level  threads  library.  The  performance  of  this  library  is  compared  with  the 
performance  of  the  Solaris  library  to  show  the  differences. 

These  measurements  have  been  taken  in  a  four  processor  SunSPARCstation20. 
It  is  based  on  50  MHz  SuperSparc  processors  with  128  Mbytes  of  shared  memorv. 


Operation 

Solaris 

dthread 

Speedup 

Create/Destroy 

2900  ^seg 

600  pseg 

4.83 

Lock/Unlock 

1.7  fx  seg 

0.68  /zseg 

2.5 

GetSpecific 

1.1  pseg 

0.53  /xseg 

2.1 

pthread-self 

0.4  ;useg 

0.58  pseg 

0.69 

Tablel.  Performance. 


Table  1  shows  the  time  spent  on  several  important  operations  both  on  dthreads 
and  Solaris  threads.  However,  these  data  are  not  enough  to  justify  the  develop¬ 
ment  of  a  new  user-level  threads  library.  Tables  2  and  3  show  the  time  used 
m  a  context  switch  with  pthread.yield  and  with  conditions.  The  differences 
between  both  libraries  are  very  important. 


Context  switches  between  user-level  threads  can  occur  very  often  in  a  par¬ 
allel  application  and  can  introduce  important  overheads  in  the  Solaris  threads 
library.  With  a  single  LWP,  that  is,  without  actual  concurrency,  Solaris  threads 
management  is  done  inside  the  user  address  space,  without  using  system  calls, 
and  with  reasonable  times.  However,  in  a  regular  situation,  with  several  user- 
level  threads  over  a  few  LWPs,  Solaris  introduces  a  lot  of  costly  system  calls  with 
high  overheads.  The  threads  management  is  not  done  in  the  user  address  space 
anymore,  and  a  lot  of  unnecessary  system  calls  can  appear  (IwpmiutexJock, 
lwp_mutex.wakeup,  lwpjsema.post,  lwp.sema.wait).  The  Dthreads  library  does 
not  use  system  calls  to  manage  user-level  threads  in  any  case,  which  explains 
the  execution  time  differences. 


There  are  some  other  reasons  to  build  the  dthreads  library: 


-  To  control  the  degree  of  concurrency.  This  library  never  changes  the  number 
of  kernel  threads  that  give  support  to  the  application  unless  the  user  asks 
for  it. 

-  To  improve  the  locality  of  references.  Solaris  uses  only  one  global  queue  to 
put  all  the  runnable  threads.  This  solution  gives  an  optimal  load  balance  but 
a  poor  locality  of  references  because  a  thread  does  not  reuse  its  state  present 
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Concurrency 

Solaris 

dthread 

Speedup 

1  Thread,  1  LWP 

16.3  p seg 

10.3  pseg 

1.6 

2  Thread,  1  LWP 

24.7  pseg 

10.3  pseg 

2.4 

4  Thread,  1  LWP 

25.1  pseg 

10.6  pseg 

2.4 

2  Thread,  2  LWP 

7.2  pseg 

6.0  pseg 

1.2 

4  Thread,  2  LWP 

79.0  pseg 

6.4  pseg 

12.3 

Table2.  Context  switch  with  pthread-yield. 


Concurrency 

Solaris 

dthread 

Speedup 

1  Thread,  1  LWP 

4.4  pseg 

2.3  pseg 

1.9 

2  Thread,  1  LWT 

38.4  pseg 

13.3  pseg 

2.9 

4  Thread,  1  LWrP 

38.9  pseg 

13.4  pseg 

2.9 

2  Thread,  2  LWP 

103.8  pseg 

18.0  pseg 

5.8 

4  Thread,  2  LWP 

121.4  pseg 

15.8  pseg 

7.7 

Table3.  Context  switch  with  conditions. 


in  the  cache  memory  of  the  last  processor  where  it  ran.  On  the  contrary, 
Dthread  library  has  local  queues  to  put  each  thread  in  the  local  runnable 
queue  of  the  last  processor  where  it  ran. 

-  To  increase  the  limit  on  the  number  of  threads.  Dthread  library  implements 
the  threads  with  a  small  state  that  reduces  the  resources  used  and  gives  the 
ability  to  increase  the  number  of  threads  that  can  be  managed. 

6  Conclusions 

As  has  been  shown,  we  have  reached  the  starting  objectives.  We  have  got  a 
general  purpose  user-level  threads  library  for  shared-memory  multiprocessors 
that  is  POSIX  compliant.  It  is  efficient,  small,  portable,  has  good  performance 
and  it  is  suitable  for  fine-grain  parallelism.  It  is  faster  than  the  Solaris  user- 
level  threads  library  and  it  solves  the  different  problems  that  the  Solaris  threads 
library  has. 
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Abstract.  On  this  paper  a  parallel  algorithm  which  allows  an  efficient  calcula¬ 
tion  of  a  simulation  of  a  laser  cavity  is  presented.  The  optimal  implementation 
of  the  algorithm  on  a  distributed  memory  multicomputer  results  in  the  choice  of 
an  optimal  grain  size.  This  grain  size  must  balance  different  factors  depending 
on  the  parameters  associated  with  the  calculation.  A  model  for  the  optimal 
choice  of  the  grain  size  is  proposed  along  with  the  corresponding  experimental 
tests.  The  theoretical  model  can  be  easily  extrapolated  to  a  great  number  of 
similar  problems. 


Related  topics:  Parallel  and  distributed  algorithms. 


1  Purpose  and  scope  of  the  work 

The  parallel  algorithm  that  will  be  studied  here  relates  to  the  simulation  of  the  physical 
behaviour  of  a  laser  cavity.  Laser  (Light  Amplifier  by  Stimulated  Emission  of  Radia¬ 
tion)  is  a  sort  of  light  with  certain  optical  properties  such  as  high  spatial  and  temporal 
coherence.  An  optical  amplifier  medium  and  two  mirrors  can  produce  the  laser  light. 
They  make  up  a  laser  cavity  as  the  one  shown  in  fig.  1.  See  [1]  for  a  practical  applica¬ 
tion  of  this  technology  and  [2]  for  a  reference  about  the  underlying  physical  problem. 

The  physical  behaviour  of  the  laser  cavity  can  be  simulated  with  a  computer.  In  a 
semiclassical  model,  we  need  five  functions  to  fully  describe  the  state  of  the  cavity  on 
a  given  time  t.  Three  of  them  are  in  connection  with  the  matter:  p(x),  q(x),  w(x)  and 
the  other  two,  a(x)  and  daldt(x),  with  the  radiation.  The  temporal  evolution  of  the 
cavity  state  obeys  to  the  partial  differential  equations  (1). 
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Fig.  1.  Laser  cavity  diagram  (up)  and  its  discretization  (down). 


If  a  discretization  of  space  and  time  dimensions  is  performed,  one  arrives  to  five  ar¬ 
rays  to  describe  the  state  of  the  cavity  (fig.  1,  down)  and  five  equations  of  temporal 
evolution  for  them  (2-6). 


p(x,t+At)  =  p(x,t)+&(x,t)At+-L^(x,t){At)2  +±.Ol(x,t).m>  (2) 

ot  2!  ot  3!  ot 

<?(.T,t  +  AO  =  ?(.r,O  +  -j(.r,O-A/  +  l^-(.r,O-(A02+|^tet)-(A/):'...  (3) 

w(x,  t  +  At)  =  w(x,  t)  +  ^(x,  t)  ■  At  +  t)  •  (At)2  +  JJ  fr(x,  t)  ■  (  At)3 . . .  (4) 

a(-r.t  +  Al)=:a(x,l)  +  ~(.r,t)At  +  l^-2.(xJ)-(At):+l^(.r,t).(Aty... 

ot  2!  ot  3!  otJ 
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(6) 


Note  that  one  must  also  take  into  account  boundary  conditions.  The  value  of  the  arrays 
on  either  side  of  the  cavity  (i.e.  where  the  mirrors  lay)  is  zero  for  all'/.  With  these 
relations,  it  is  straightforward  to  device  a  sequential  algorithm  for  simulating  the  cav¬ 
ity  behaviour. 
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Fig.  2.  Spatial  dependencies  for  the  calculations  of  the  state  of  one  point  in  the  cavity  in  the 
next  time  step.  Alfa  and  dalfa  are  the  names  of  the  arrays  used  in  the  program  for  a  and  da/dt. 


When  thinking  about  making  a  parallel  algorithm  to  simulate  the  laser  cavity  physical 
behaviour,  one  must  consider  the  spatial  dependencies  for  calculating  the  value  of  the 
parameters  in  one  point  for  the  next  time  step  in  the  future.  Fig.  2  shows  these  depend¬ 
encies  for  a  certain  point  in  the  cavity. 

A  first  approach  to  the  parallel  algorithm  implementation  can  be  the  use  of  the  "di- 
vide-and-conquer"  techniques  by  dividing  the  cavity  points  in  equal  parts  among  the 
distinct  processors  that  make  up  the  multicomputer.  Of  course  that,  due  to  spatial 
dependencies,  this  partition  must  consider  the  boundary  overlapping  points  necessary 
for  calculations  to  be  performed.  The  overlapping  points  should  be  at  least  two  (eq.  2- 
6).  The  temporal  evolution  for  problem  solution  forces  a  synchronization  and  the  cor¬ 
responding  communication  for  interchanging  boundary  points  between  neighbouring 
processors  (fig.  3,  left). 
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Some  initial  studies  demonstrated  that  depending  on  the  cavity  size  and  the  number 
of  processors  used,  this  simple  calculation  plan  could  produce  a  low  performance  due 
to  communication  penalty  introduced  to  the  parallel  algorithm.  In  other  words,  the 
grain  size  is  too  small  for  the  work  environment  used. 
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Fig.  3.  Evolution  diagram  of  the  parallel  algorithm  with  two  processors  (see  text). 

Thus,  a  first  objective  consists  of  designing  a  new  algorithm  that  will  allow  the  use 
of  different  sizes  for  the  overlapping  zone  between  processors  so  that  the  grain  size 
can  be  increased  to  allow  higher  speedups.  A  working  diagram  of  the  new  algorithm  is 
presented  in  fig.  3,  right. 

A  second  objective  would  be  thinking  about  some  method  for  an  adequate  choice  of 
the  optima!  grain  size  that  can  allow  the  best  possible  speedup  as  a  function  of  the 
number  of  processors  and  the  cavity  size  of  the  problem  to  be  solved. 


2  Fundamental  Results  Already  Obtained 


2.1  Hardware/Software  Configuration 

To  do  the  parallel  calculation,  a  PC  network  has  been  used.  Each  node  has  an  Intel 
Pentium  II  Processor  @  266  MHz  and  64Mb  of  RAM.  Linux  was  used  as  the  Operat¬ 
ing  System.  Relating  to  communication,  each  computer  has  a  Fast-Ethemet  card  con¬ 
nected  to  a  switch.  The  message  passing  software  used  is  MPI,  in  its  LAM/MPI  ver¬ 
sion  6.3. b2  implementation  [3], 
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Fig.  4.  Left:  communication  time  vs.  Overlapping  for  a  128000  point  cavity  and  16  processors. 
Right:  maximum  reachable  speedup  due  to  the  algorithm  overhead  versus  overlapping. 


2.2  Grain  Size  Behaviour  Issues 

The  new  algorithm  presented  in  1.2  introduces  different  factors  that  will  influence  the 
final  speedup  in  different  ways.  In  fact,  as  the  size  of  the  overlapping  zone  increases: 

1.  The  amount  of  information  to  be  transmitted  increases.  An  example  of  time  spent  in 
communication  vs.  size  of  the  overlapping  zone  is  presented  in  fig.  4,  left. 

2.  The  grain  size  grows,  which  is  positive  for  the  final  speedup. 

3.  The  parallel  algorithm  overhead  increases,  that  is,  some  mesh  points  are  calculated 
twice  by  different  processors  (fig.  3).  This  will  yield  a  speedup  decrease  and  a 
maximum  reachable  speedup  as  a  function  of  the  size  of  the  overlapping  zone.  This 
speedup  can  be  mathematically  deduced  and  is  plotted  for  an  example  in  fig.  4,  left. 

As  it  was  stated  before,  now  we  will  propose  a  theoretical  model  that  will  allow  us 
to  predict  the  best  overlapping  size  for  a  given  cavity  size  and  number  of  processors. 
Let  us  define  a  cycle  as  the  time  between  successive  intercommunications.  We  can 
define  then  the  speedup  as: 


°  .  A  calc)  ,  Acorn) 

lp  lp  1  p 

where  ts  stands  for  sequential  time,  tp  means  parallel  time,  tl“'k)  is  time  spent  in 
calculation  for  the  parallel  algorithm  and  t{‘"m)  is  time  spent  in  communication. 

Considering  a  linear  dependency  of  communication  time  on  the  overlapping  size 
(as  it  suggests  fig.  4,  left),  one  can  arrive  to  the  following  expression  for  the  speedup: 
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S  = 


T(NP-  2) <7/ 


2(a  +  ba)  +  r\ 


NP-2 


N 


-1  +  °/, 


(8) 


where  x  is  the  time  spent  on  calculations  of  one  point  in  the  cavity,  NP  is  the  num¬ 
ber  of  cavity  points,  a  is  the  overlapping  size,  a+bo  is  the  linear  dependence  of  com¬ 
munication  time  with  o,  and  N  is  the  number  of  processors. 


The  overlapping  size  that  maximizes  (11)  can  be  deduced  from  it  and  it  is: 


(9) 


2.3  Experimental  results 

Several  experimental  tests  have  been  carried  out  in  order  to  verify  the  theoretical 
model  previously  exposed.  There  is  a  good  agreement  between  the  calculated  optimal 
overlapping  and  the  experimental  one  in  the  cases  we  have  analyzed.  Some  of  these 
experimental  results  are  shown  in  the  following  table: 


#  processors 

Cavity  size 

16 

128000 

37.2 

34 

16 

40000 

22.6 

24 

3  Conclusions 

A  parallel  algorithm  for  a  laser  cavity  simulation  has  been  developed.  This  algorithm 
tries  to  obtain  an  optimal  speedup  by  adequately  selecting  a  grain  size  that  balances 
the  calculation/communication  binomial.  The  optimal  selection  of  the  grain  size  is 
done  by  means  of  a  very  simple  theoretical  easy-to-calculate  prediction  which,  addi¬ 
tionally,  could  be  extrapolated  to  similar  algorithms  for  simulations  of  space/time 
evolution  of  physical  systems  in  the  future. 
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Abstract.  This  paper  outlines  key  issues  that  must  be  addressed  in 
order  to  allow  PVM-based  programs  to  make  effective  use  of  resources 
within  a  wide  area  network-computing  environment.  Support  mecha¬ 
nisms  that  allow  unmodified  PVM  programs  to  be  used  within  the 
PUNCH  network-computing  environment  are  also  described.  The  mech¬ 
anisms  were  found  to  be  easy  to  implement,  and  preliminary  experi¬ 
ences  indicate  that  the  described  approach  is  well-suited  for  a  network¬ 
computing  environment. 


1  Introduction 

Distributed  applications  are  often  built  on  top  of  message-passing  standards  such 
as  PVM  [1]  and  MPI  [2] .  These  standards  were  originally  designed  for  relatively 
structured  environments,  where  users  are  aware  of  all  available  machines  and 
have  direct  access  to  them.  In  this  context,  the  emerging  wide  area  network¬ 
computing  environment  presents  two  interesting  challenges:  1)  the  large  size 
of  the  environment  makes  it  difficult  for  users  to  keep  track  of  all  available 
resources,  and  2)  the  dynamic  and  inter-institutional  nature  of  the  environment 
causes  logistical  problems  when  users  are  required  to  have  actual  user-accounts 
on  all  resources. 

This  paper  describes  how  PVM-  and  MPI-based  programs  can  make  effective 
use  of  resources  within  a  wide  area  network-computing  environment  by  lever¬ 
aging  the  functionality  provided  by  PUNCH,  the  Purdue  University  Network- 
Computing  Hubs.  A  unique  aspect  of  the  described  implementation  is  that  nei¬ 
ther  the  PVM/MPI  programs  nor  the  PVM/MPI  libraries  need  to  be  modified. 

PUNCH  [3,4]  is  a  distributed  network-computing  infrastructure  that  pro¬ 
vides  transparent  and  universal  access  to  remote  programs  and  resources  via  the 
World  Wide  Web.  PUNCH  users  can  define  simulations,  run  them,  and  view 
the  text  and  graphical  output  —  all  via  their  Web  browsers.  PUNCH  currently 
provides  access  to  more  than  fifty  engineering  software  packages  developed  by 
thirteen  universities  and  six  vendors:  a  new  program  can  be  added  in  as  little  as 
thirty  minutes.  PUNCH  can  be  accessed  at  wwM.ece.purdue.edu/punch. 
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The  discussion  in  this  paper  focuses  on  PVM-based  programs,  but.  the  ideas 
are  equally  applicable  to  MPI  programs.  The  remaining  sections  are  organized  as 
follows.  Section  2  outlines  the  issues  that  arise  in  the  process  of  running  a  PYM 
program.  Section  3  describes  the  support  mechanisms  provided  by  PUNCH  for 
running  PYM  programs  in  a  network-computing  environment.  Section  4  pro¬ 
vides  outlines  related  work.  Finally.  Section  5  presents  concluding  remarks  and 
directions  for  future  work. 


2  Issues  in  Running  PVM  Programs 

Running  a  PVM  program  in  an  environment  where  a  user  has  direct  access  to 
all  machines  typically  involves  the  following  steps.  The  user  must  first  select  the 
machines  for  the  given  run  and  choose  a  •'master"  machine.  Next,  he/she  must 
login  to  each  "slave"  machine  and  create  a  .rhost  file  that  will  allow  PVM  to 
start  processes  on  that  machine.  After  this,  the  user  must  create  a  PVM  host 
file  on  the  master  machine;  this  file  provides  PVM  with  information  about  the 
available  machines.  Once  this  is  done,  the  user  must  start  the  PY’M  daemons 
by  invoking  the  PVM  console  on  the  master  machine.  At  this  point,  the  PVM 
system  has  been  initialized  and  the  user  can  start  the  PVM  program. 

In  an  environment  where  users  are  not  aware  of  all  available  resources,  the 
steps  described  above  must  be  automated.  In  situations  where  users  do  not  have 
user-accounts  on  all  machines,  operating  system  support  for  "scratch"  accounts 
must  be  provided.  The  resulting  sequence  of  steps  required  to  start  a  PVM 
program  in  a  wide  area  network-computing  environment  is  illustrated  in  Figure  1. 


3  The  PUNCH  Approach 


PUNCH  users  initiate  programs  via  a  dynamically-generated  Web  interface  that 
is  accessible  from  standard  WWW  browsers  [4],  For  PVM-  and  MPI-based  pro¬ 
grams,  users  explicitly  specify  the  number  and  types  of  machines  required  for  a 
given  run,  in  addition  to  other  input  parameters  required  by  the  program.  This 
information  is  typically  provided  via  menus  and  text-boxes  in  HTML  forms. 

When  a  user  attempts  to  initiate  a  PVM-  or  MPI-based  program,  PUNCH 
first  allocates  the  necessary  resources  using  the  user-supplied  information  about 
the  number  and  type  of  machines.  With  reference  to  Figure  1.  resource  allocation 
(step  1  in  the  figure)  involves  two  tasks:  1)  selecting  appropriate  machines  for 
the  given  run,  and  2)  ensuring  that  a  scratch  account  is  available  for  use  on  each 
of  the  selected  machines. 


The  process  of  allocating  resources  is  handled  by  PUNCH’S  pipelined  resource 
management  system,  and  proceeds  as  follows  (see  Figure  2).  PUNCH  first  for¬ 
wards  the  user-supplied  information  about  machines  to  a  local  query  manager. 
The  query  manager  decomposes  this  information  into  individual  components, 
each  of  which  consists  of  a  set  of  constraints  (e.g.,  architecture,  memory,  need 
for  scratch  account,  etc.)  and  a  quantity  (i.e.,  number  of  machines  of  this  type). 
For  example,  a  request  for  three  Sun  and  four  HP  servers  will  be  decomposed 
into  two  components,  one  for  the  three  Suns  and  one  for  the  four  HPs. 
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Requirements  specified  by- 
user;  resources  consist  of 
machines  and  scratch  accounts 

Select  ' 'master' '  machine  and 
create  ''.rhost''  files  for 
access  management 


Make  allocated  machines  and 
resources  available  to  PVM 


Set  up  PVM  by  automatically 
invoking  the  PVM  console 


Program  may  be  batch  or 
’interactive 


Wait  for  PVM  program  to 
complete 


Retrieve  files,  remove  rhost 
access,  and  return  allocated 
machines  and  scratch  accounts 
to  the  resource  pool 

Fig.  1.  The  sequence  of  steps  required  to  run  a  PVM  program  in  a  wide  area  network¬ 
computing  environment.  It  is  assumed  that  users  are  not  aware  of  all  available  resources, 
and  that  they  do  not  have  user-accounts  on  all  usable  machines. 

The  individual  components  are  then  forwarded  to  the  nearest  (in  terms  of 
network  reachability)  pool  manager(s),  where  they  are  processed  concurrently.  (If 
a  pool  manager  is  unable  to  satisfy  a  request,  the  query  manager  will  forward  the 
request  to  the  next  nearest  pool  manager.)  The  pool  manager  uses  the  constraints 
contained  within  a  given  query  component  to  map  it  to  an  appropriate  resource 
pool.  A  resource  pool  consists  of  1)  all  machines  in  a  specified  local  domain  that 
satisfy  a  given  set  of  constraints,  and  2)  scheduling  agents  that  select  machines 
from  "those  within  the  pool  on  the  basis  of  performance-related  criteria  (e.g., 
load  balancing).  For  example,  one  pool  could  contain  all  Sun  machines,  another 
could  contain  all  HP  machines,  a  third  could  contain  all  Sun  machines  with 
at  least  128MB  memory,  and  so  on.  After  a  resource  manager  maps  a  query- 
component  to  a  resource  pool,  the  scheduling  agents  associated  with  that  pool 
allocate  the  desired  number  of  machines  and  forward  relevant  information  to 
another  query  manager  stage  (not  shown  in  Figure  2;  this  stage  is  only  required 
for  queries  that  have  to  be  decomposed  into  multiple  components).  The  query 
manager  reassembles  the  individual  query  components  and  forwards  the  results 
to  PUNCH. 
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Fig.  2.  The  resource  management  pipeline  utilized  by  PUNCH.  A  resource  pool  consists 
of  dynamically  aggregated  resources  that  are  similar  in  terms  of  a  specified  set  of 
constraints  and  the  associated  scheduling  agents. 


A  key  feature  of  the  resource  management  pipeline  is  that  the  resource  pools 
are  created  dynamically  from  site-specific  databases.  Pool  managers  create  re¬ 
source  pools  when  they  are  required  and  automatically  destroy  pools  that  have 
been  inactive.  This  mechanism  allows  the  resource  management  system  to  dy¬ 
namically  minimize  the  scheduling  overhead  for  the  specific  types  of  jobs  that, 
are  being  initiated  at  any  given  time.  Another  benefit  comes  from  the  manner  in 
which  pool  managers  are  chosen  —  because  “closer”  managers  are  selected  first, 
allocated  machines  tend  to  be  near  each  other.  (In  the  current  implementation, 
the  closeness  between  managers  is  defined  by  a  static  quantity  and  machines 
controlled  by  the  same  manager  are  assumed  to  be  at  zero  distance.) 

Once  the  necessary  resources  have  been  allocated,  the  host  environment  is 
initialized  as  follows  (step  2  in  Figure  1).  PUNCH  first  selects  the  “master”  for 
this  run  —  the  first  allocated  machine  is  arbitrarily  chosen  for  this  role.  Next, 
PUNCH  uses  secure  shell  (SSH  [5])  to  login  to  the  allocated  scratch  account  on 
each  of  the  remaining  allocated  machines  and  creates  the  necessary  .rhost  files. 
If  secure  shell  is  not  available  on  a  given  machine,  rsh  or  rexec  can  be  used 
instead.  The  key  advantage  of  the  PUNCH  approach  with  respect  to  this  step  is 
that  neither  PVM  nor  the  user  need  to  be  given  access  to  the  passwords  for  the 
scratch  accounts.  This  allows  PUNCH  to  recycle  scratch  accounts  among  users 
in  a  secure  manner. 

The  third  step  involves  generating  a  PVM  host  file  that  contains  the  names 
of  the  machines  allocated  for  this  run  and  the  login  names  for  the  corresponding 
scratch  accounts.  PUNCH  uses  secure  shell  to  access  the  scratch  account  on  the 
“master”  machine  and  writes  the  appropriate  information  into  a  new  file. 

The  fourth  step  involves  starting  PVM  daemons  on  all  allocated  machines 
PUNCH  accomplishes  this  by  invoking  the  PVM  console  in  the  scratch  account 
on  the  master  machine;  the  daemons  on  the  slave  machines  are  automatically 
started  by  PVM  (via  the  information  in  the  PVM  host  file)  when  the  console  is 
invoked. 

Once  the  PVM  daemons  are  running,  PUNCH  copies  the  necessary  data 
files  into  the  scratch  account  on  the  “master”  machine  (from  the  user’s  PUNCH 
account)  and  initiates  the  PVM-based  program  (step  5  in  the  figure).  If  the 
program  is  designed  to  run  in  batch  mode,  it  is  started  in  the  background: 
otherwise,  it  is  started  within  a  X-session  that  is  accessible  by  the  user  via 
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his/her  browser  [4],  A  unique  feature  of  the  PUNCH  approach  is  that  it  is  a 
non-intrusive  solution  —  the  PVM  system  and  the  PVM-based  program  are 
completely  unaware  of  PUNCH.  One  advantage  of  this  is  that  PUNCH  can 
support  unmodified  (i.e.,  legacy)  PVM-based  programs  as  long  as  they  do  not 
use  hard-coded  machine  names.  (This  limitation  can  be  removed  by  trapping 
rsh  calls  from  PVM  and  modifying  them;  see  [6]  for  details.)  Another  benefit 
is  that  PUNCH  does  not  affect  the  performance  of  the  programs,  except  to  the 
extent  that  it  makes  resource  allocation  decisions. 

At  this  point,  PUNCH  simply  waits  for  the  PVM-based  program  to  complete 
(step  6  in  Figure  1).  When  this  happens,  PUNCH  will  first  retrieve  output  files 
from  the  scratch  accounts  and  place  them  in  the  user’s  PUNCH  account.  Then, 
PUNCH  will  stop  the  PVM  daemons  (via  the  PVM  console),  terminate  any 
active  processes  within  the  allocated  scratch  accounts,  and  remove  the  .rhost 
files.  Once  PUNCH  has  verified  that  the  scratch  accounts  are  ‘‘clean”  (i.e.,  empty 
and  no  active  processes),  they  will  be  returned  to  the  account  pool. 

4  Related  Work 

MPICH-G  [7]  is  a  grid-enabled  implementation  of  MPI  that  uses  services  pro¬ 
vided  by  the  Globus  toolkit  [8]  to  allow  users  to  run  MPI  programs  within  a  wide 
area  network-computing  environment.  This  work  makes  existing  MPI-based  pro¬ 
grams  usable  in  a  network-computing  environment  by  enhancing  the  capabilities 
of  MPI  itself,  whereas  the  PUNCH  approach  provides  support  mechanisms  that 
work  with  unmodified  implementations  of  PVM/MPI.  Another  difference  be¬ 
tween  the  two  approaches  is  that  MPICH-G  requires  users  to  have  user-accounts 
on  all  machines  that  might  be  used  to  run  the  MPI  program,  whereas  PUNCH 
uses  scratch  accounts  to  work  around  this  problem.  (PUNCH  provides  admin¬ 
istrators  with  a  way  to  specify  usage  policies  so  that  only  authorized  users  are 
given  access  to  machines.) 

Legion  [9]  allows  PVM  programs  to  run  in  the  Legion  network-computing 
environment  by  emulating  the  PVM  API  on  top  of  the  Legion  run-time  system. 
This  approach  is  fairly  complex  from  an  implementation  standpoint,  and  does 
not  support  the  complete  PVM  API  [10]. 

Finally,  Condor  [11]  provides  support  for  PVM  programs  that  are  based  on 
the  master- worker  paradigm.  One  issue  that  arises  in  Condor  s  opportunistic 
computing  environment  is  that  the  "master  process  must  be  able  to  handle  the 
disappearance  of  worker  nodes;  the  "master”  process  can  compensate  for  lost 
nodes  by  (dynamically)  requesting  additional  machines. 


5  Conclusions 

A  prototype  version  of  PUNCH  that  allows  users  to  run  unmodified  P\  M-based 
programs  in  a  wide  area  network-computing  environment  has  been  implemented 
and°tested.  Preliminary  results  show  that  the  described  approach  efficiently  man¬ 
ages  available  resources.  Support  for  MPI-based  programs  is  being  added;  this 
is  a  relatively  simple  extension  of  the  work  described  here. 
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The  implementation  described  in  this  paper  does  not  provide  support  for  dy¬ 
namically  increasing  or  decreasing  the  number  of  machines  available  to  a  running 
P\  M  program.  Future  work  will  be  aimed  at  adapting  the  type  and  number  of 
machines  available  to  a  PVM  program  on  the  basis  of  observed  and  predicted 
performance  characteristics. 

Acknowledgements 

This  work  was  partially  funded  by  the  National  Science  Foundation  under  grants 
EEC-9700762,  ECS-9809520,  EIA-9872516,  and  EIA-9975275,  by  an  academic 
reinvestment,  grant  from  Purdue  University,  and  by  a  grant  from  the  Commission 
for  Cultural,  Educational  and  Scientific  Exchange  between  the  L'nited  States  of 
America  and  Spain. 


References 

1.  Vaidy  Sunderam,  G.  A.  Geist,  J.  Dongarra,  and  R.  Manchek.  The  PVM  concur¬ 
rent  computing  system:  Evolution,  experiences,  and  trends.  Parallel  Computing 
20(4): 53 1— 54T,  April  1994. 

2.  William  Gropp,  Ewing  Lusk,  N.  Doss,  and  Anthony  Skjellum.  A  high-performance, 
portable  implementation  of  the  MPI  message  passing  interface  standard.  Parallel 
Computing ,  22(6):789-828,  September  1996. 

3.  Nirav  H.  Kapadia  and  Jose  A.  B.  Fortes.  PUNCH:  An  architecture  for  web-enabled 
wide-area  network-computing.  Cluster  Computing:  The  Journal  of  Networks.  Soft¬ 
ware  Tools  and  Applications,  2(2) :  153-164,  September  1999.  In  special  issue  on 
High  Performance  Distributed  Computing. 

4.  Nirav  H.  Kapadia,  Jose  A.  B.  Fortes,  and  Mark  S.  Lundstrom.  The  Purdue  Uni¬ 
versity  Network-Computing  Hubs:  Running  unmodified  simulation  tools  via  the 
WWW.  ACM  Transactions  on  Modeling  and  Computer  Simulation ,  2000.  In 
forthcoming  special  issue  on  Web-based  Modeling  and  Simulation. 

5.  SSH  2.0  protocol  specifications.  Internet  Engineering  Task  Force  (IETF)  drafts 
available  at  http://info.internet.isi.edU/l/in-drafts. 

6.  Arash  Baratloo,  Ayal  Itzkovitz,  and  Zvi  M.  Kedem.  Just-in-time  resource  man¬ 
agement  in  distributed  systems.  Technical  Report  TR1998-762,  Department  of 
Computer  Science,  New  York  University,  March  1998. 

7.  Ian  Foster  and  Nicholas  T.  Karonis.  A  grid-enabled  MPI:  Message  passing  in  het¬ 
erogeneous  distributed  computing  systems.  In  Proceedings  of  the  Supercomputing 
Conference,  1998. 

8.  Ian  Foster  and  Carl  Kesselman.  The  Globus  project:  A  status  report.  In  Proceedings 
of  the  1998  Heterogeneous  Computing  Workshop  (HCW'98),  pages  4-18,  1998. 

9.  Andrew  S.  Grimshaw  and  William  A.  Wulf.  Legion:  Flexible  support  for  wide-area 
computing.  In  Proceedings  of  the  7th  ACM  SIGOPS  European  Workshop.  Ireland 
1996. 

10.  Roger  R.  Harper.  Interoperability  of  parallel  systems:  Running  pvm  applications 
in  the  legion  environment.  Technical  Report  CS-95-23,  Department  of  Computer 
Science.  University  of  Virginia,  May  1995. 

11.  Condor  Version  6.1.12  Manual.  March  2000. 


-454- 


VECPAR  '2000  -  4th  International  Meeting  on  Vector  and  Parallel  Processing 


Simulating  2-D  Froths;  Fingerprinting  the 
Dynamics 

H.  J.  Ruskin*  and  Y.  Feng*+ 

*  School  of  Computer  Applications,  Dublin  City  University,  Dublin  9,  Ireland. 
+Novell  Ireland  Ltd.,  Dublin  1,  Ireland. 


Abstract 

Abstract:  Topological  measures  are  an  obvious  choice  for  investigation 
of  cellular  systems,  and  average  topological  properties  of  a  froth,  defined 
to  be  shell- structured  inflatable  (SSI),  have  been  shown  to  obey  simple 
relations.  However,  froth  is  an  intrinsically  non-equilibrium  system,  and 
SSI  froths  typically  become  non-SSI  as  coarsening  progresses,  so  that 
more  general  probes  may  provide  further  insight.  Cluster  persistence  per¬ 
mits  fingerprinting  of  froth  dynamics  at  different  length  scales  and  facili¬ 
tates  comparison  with  non-cellular  structures.  There  is  evidence  to  show 
that  the  average  persistent  area  in  a  froth  achieves  a  stable  value,  but 
support  for  power  law  decay  of  the  average  bubble  fraction  cannot  be  es¬ 
tablished  for  intermediate  time  scales.  We  present  simulation  results  for 
both  Voronoi  and  uniform  2-D  froths  and  examine  the  case  for  topological 
and  non-topological  probes  of  the  dynamics. 


1.  Introduction 

The  soap  froth  is  an  ideal  model  of  a  cellular  network,  which  is  disordered  and 
space-filling,  [1-7].  It  is  an  intrinsically  non-eQuilibrium,  system,  which  evolves 
to  a  universal  stable  state,  through  surface-energy  driven  diffusion.  Evolution  or 
coarsening  is  associated  with  two  separate  dynamics,  with  very  different  rates 
of  occurrence,  [8].  The  first  is  due  to  rapid  topological  transformations  with 
corresponding  changes  in  connectivity,  which  occur  system-wide.  The  second 
reflects  slow,  deterministic  relaxation  over  a  long  time-scale,  as  a  consequence 
of  diffusion  of  gas  between  the  bubbles.  The  steady-state  evolution  of  the  froth 
has  been  characterised  by  laws  describing  the  statistics  of  cell  area,  [9],  the 
growth-rate  of  n-sided  cells,  [10]  and  scaling  properties  of  cells  [11]. 

Initially,  correlation  effects  were  considered  to  be  restricted  to  nearest-neighbours 
only,  through  the  Aboav-Weaire  Law  [12].  The  average  number  of  neighbours 
of  an  n-sided  cell  is  given  as  m(n)  =  (6-a)  +  (6a+  ti-2)/n,  with  /i2  the  second 
moment  of  the  side  distribution  f(n)  and  a  the  Aboav-Weaire  parameter.  More 
detailed  topological  correlations  have  recently  been  derived,  however,  based  on 
analysis  of  the  froth  as  a  system  of  concentric  cells,  which  can  be  generated  re¬ 
cursively,  [13-15].  The  distance  j  between  any  two  cells  is  the  smallest  number 
of  edges  crossed  by  paths  connecting  one  to  the  other.  Any  cell  may  be  taken 
as  the  "germ"  cell  j  =  0  and  layers  or  shells  of  equidistant  j  =  0,1,2  ...cells  are 
such  that  the  jth  layer  of  cells  at  distance  j  encloses  layers  j-1,  j-2,....0  and 
includes  all  cells  which  are  themselves  neighbours  of  at  least  one  cell  at  distance 
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i  +  1-  An.>'  ce*b  which  does  not  obey  this  condition,  is  said  to  lie  between  lay¬ 
ers  j  -  1  and  j  and  represents  a  localised  defect  inclusion  with  respect  to  the 
froth  "skeleton".  Any  froth,  without  defect  inclusions  is  called  shell- structured 
inflatable  (SSI),  [13].  In  the  asymptotic  steady-state,  topological  properties  are 
invariant,  with  p2  achieving  a  constant  value  and  with  the  average  cell  area 
proportional  to  the  square  root  of  the  time.  Furthermore,  /r->  is  a  measure  of 
the  disorder  in  the  froth,  which  affects  both  the  evolution  and  the  fraction  of 
initial  cells  remaining  [16].  These  remainder  or  survivors  are  cells  which  are 
present  at  a  given  time  tj  and  which  were  also  present  at  ti  ,ti  <:  £/,  [17].  Most 
evolutionary  properties  are  based  on  the  contribution  of  survivors  at  different 
stages,  so  that  it  is  more  reasonable  to  choose  a  known  survivor  as  the  germ 
cell  in  a  dynamic  investigation,  although  the  theory  equally  applies  to  any  other 
choice. 

Although  topological  measures  are  a  natural  choice  for  assessing  evolution  of  a 
froth  over  time,  more  general  measures  provide  a  useful  basis  for  comparison 
with  systems  which  do  not  have  cellular  structure,  [18].  The  local  decay  of 
persistence  towards  zero,  P(tot)  ~  t~e  was  first  proposed  as  a  new  and  gen¬ 
eral  probe  of  non-equilibrium  dynamics  [19]  and  has  recently  been  discussed  in 
some  detail.  To  date,  however,  numerical  simulation  results  for  the  value  of  the 
exponent  9  are  not  in  good  agreement  with  theory  and  experiment,  although  a 
limiting  value  of  9  =  1  is  indicated  by  both,  ([18]  and  refs,  therein). 

In  a  froth,  the  persistent  property  of  interest  may  be  taken  to  be  the  fraction  of 
the  system  which  has  remained  within  the  same  bubble  from  initial  time  t0  to 
given  time  t.  More  generally,  [20],  the  known  cellular  structure  of  the  froth  may 
be  exploited  for  comparison,  by  definition  of  a  virtual  phase ,  where  a  given  frac¬ 
tion  0,  say,  of  the  bubbles  are  "coloured"  at  time  f0  and  persistent  properties  of 
this  fraction  are  studied  as  t  -4  t^.  The  persistent  area,  is  thus  bounded  above 
and  below  respectively,  by  areas  of  coloured  survivors  and  ancestors  at  to,  [18], 
(where  ancestors  are  predecessors  of  the  bubbles  remaining  at  time  t).  Again, 
selection  of  known  survivors  for  colouring,  facilitates  dynamical  investigation  of 
the  froth  properties. 

In  what  follows,  we  report  for  both  randomly-generated  and  uniform  froths  on 
the  dynamics  of  evolution  as  charted  by  both  topological  and  persistence  probes. 


2.  Methodology- 

Direct  simulation,  using  the  method  of  Weaire  and  Kermode,  [21],  provides 
precise  information  on  independent  bubble  parameters,  with  clear  distinction 
made  between  topological  and  diffusive  changes.  In  2-d,  the  former  include  Tl 
and  T2  processes,  (side-  switching  and  bubble  disappearance  respectively),  and 
are  effectively  instantaneous.  Conversely,  bubble-size,  (number  of  sides  n,  area 
A)  evolves  continuously,  but  only  cells  with  n  >  6  will  grow,  by  von  Neumann’s 
Law,  with  rate  dependent  on  the  initial  disorder  in  the  froth. 
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A  Voronoi  froth  is  intrinsically  disordered  and  non-SSI ,  (Fig.  1). 


Fig.  1.  Voronoi  froth,  illustrating  topological  inclusion  in  shell  stucture;  intrinsically 
non-SSI 

A  uniform  hexagonal  froth  however  is  in  mechanical  equilibrium,  so  that  bubble 
movement  must  be  stimulated  initially  by  seeding  the  froth  with  one  or  more 
topological  dislocations  or  defects.  The  simplest  forms  of  defect  are  achieved  by 
forcing  either  a  T1  or  T2  process  to  give  a  pentagon-heptagon  pairing  or  an 
eight-sided  single  cell  respectively.  The  large  cells  are  obvious  survivor  choices 
of  germ  cell  for  shell-structure  analysis.  A  single  defect  has  been  shown  to 
grow  rapidly  until  it  effectively  consumes  the  whole  system,  [22],  and  multiple 
defects  expand  until  impacting  with  each  other,  after  which  changes  are  slower. 
Ancestors  can  be  backtracked  to  the  time  origin,  providing  not  only  a  more 
natural  time  scale  for  determining  the  existence  of  a  fixed  distribution,  f(n), 
but  a  basis  to  "colour"  sensibly  the  required  volume  or  sampling  fraction  <f>  in 
an  examination  of  persistence  in  the  network. 


Fig.  2.  Voronoi  froth,  with  random  colouring  for  volume  (sampling)  fraction  0.2 

The  virtual  phase  for  the  Voronoi  network,  (randomly  coloured  bubbles  Fig.  2), 
and  for  the  hexagonal  network,  (centred  on  defect  choice,  Fig.  3),  have  been 
followed  over  time  for  different  "volume"  or  sampling  fraction,  0,  ranging  from 
0.02  to  0.4.  This  is  contrasted,  (for  the  Voronoi),  with  the  behaviour  observed  if 
survivors  at  time  ti>  to  are  taken  to  be  the  original  sample.  This  latter  choice 
obviously  biases  the  relative  area,  since  bubbles  at  t,  will  be  large  compared  to 
those  at  t0,  but  ensures  that  survivors  are  featured  at  the  crucial  period  and 
provides  confirmation  that  the  equilibrium  value  has  been  achieved.  Systems  of 
up  to  2500  bubbles  were  considered. 
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Fig.  3.  Hexagonal  froth  with  defect  as  centrel  germ  cell  choice  in  shell  structure, 
(SSI).  Colouring  for  volume  (sampling)  fraction  0  in  persistence  also  includes  at  least 
one  such  defect. 


For  the  topological  measures,  we  have  chosen  the  germ  cell  at  random  for  the 
Voronoi,  (as  in  Fig.  1)  and  as  a  defect  for  the  uniform  network,  (as  above).  Key 
equations  for  the  topological  properties  in  an  SSI  froth  have  been  given,  [13]  to 
be 

KJ+ 1  =  sjKj  -  Kj-i  ( j  >1)  (1) 

Qj  —  6  —  Kj+i+Kj  (2) 

where  Kj  is  the  total  number  of  cells  in  the  layer  j  and  Sj=  mJ_4  is  a  constant, 
{mj  is  the  average  number  of  sides  per  cell  in  the  layer  j).  The  logistic  map 
starts  with  K0=  1  and  Kx  -  n,  the  no.  of  sides  (or  more  generally  neighbours) 
to  the  central  cell.  Equation  (2)  is  a  special  case  of  the  more  general  expression 
for  topological  charge ,  Qj,  from  the  "Gauss"  theorem,  [23],  where  the  general 

form  applies  to  any  froth,  whether  SSI  or  not. 


An  approximate  expression  for  the  Aboav-Weaire  law  for  higher  shell  number 
has  been  proposed  [25]  as 


mjKj  t*  6 Kj  +  (2  -  a) ft 2  (3) 

which  is  trivial  for  the  second  term  on  the  right  hand  side  =  0,  but  suggests 
that,  in  the  asymptotic  limit  (for  j),  a  froth  can  only  be  free  of  defects  if  p2  =  0 

°r  a  ~  2  and  we  have  also  explored  this  for  the  hexagonal  network,  for  controlled 
disorder. 

3.  Results 

In  the  Voronoi  froth,  the  number  of  survivors  in  the  virtual  phase  at  the  early 
time  period  is  large  and  the  distribution  of  area  at  t  to  initial  area  (A*(£)/A(0)) 
is  left-skewed.  However,  as  the  froth  coarsens  and  bubbles  disappear,  this  is 
gradually  reversed,  as  few  survivors  have  non-zero  persistent  area.  For  0  very 
small  ~  0.02,  this  constitutes  a  small  sampling  fraction  of  the  finite  system  size, 
(50  in  2500  bubbles),  so  that  over  a  large  number  of  time  steps,  the  quantity 
<  A*(()/A(0)  >  achieves  a  relatively  stable  value  of  just  under  0.4  for  the  biased 
sample  and  this  appears  to  agree  reasonably  well  with  the  value  indicated  for 
the  initial  random  sample,  although  equilibrium  is  less  clearly  established  in 
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this  case.  As  0  is  increased,  (0.1, 0.2  say),  decay  is  slower  and  it  is  not  evident 
that  a  time-independent  value  is  finally  achieved,  although  the  curves  do  flatten 
around  t  =  700  -  750  time  steps  for  the  biased  sample  in  both  cases.  This  occurs 
at  a  value  less  than  0.5,  and  indications  are  that  a  similar  value  is  attained  for 
the  random  sample  in  both  cases.  Consequently,  the  qualitative  evidence  is 
reasonably  supportive  of  a  time-independent  form  for  <  T*(f)/A(0)  >  with  a 
value  between  0.35  and  0.55,  (Fig.  4). 


<A'(tVA(C)> 


Fig.  4.  Average  area  of  persistent  regions  within  a  bubble  at  time  t,  normalised  by 
area  of  bubble  at  time  to  (persistent  area  ratio)  as  a  function  of  time 
A  (0=  0.02, biased),  B (0=  0.1, biased),  C(0=  0.2, biased),  D(0=  0.02, random),  E(0= 
0.1, random),  F  (0=  0.2, random) 

Further,  N*{t)/N{t),  the  fraction  of  bubbles  containing  persistent  area  at  time 
t,  is  clearly  expected  to  decrease  with  t  and  plotted  against  average  bubble  area 
<  A(t)  >  for  t  large,  we  might  hope  to  observe  decay.  Unfortunately,  it  seems 
clear  that  the  percentage  of  initial  bubbles  which  contribute  to  any  decay  is 
extremely  small  (<  5%)  and  the  simulation  time-scale  is  too  short  to  be  able  to 
view  this  for  the  Voronoi,  with  area  growth  inevitably  limited  by  the  finite  size 
of  the  system.  For  <j>  very  small,  there  is  considerable  noise,  which  decreases  as 
0  increases  but  again  no  decay  can  be  observed  for  N*(t)/N(t). 

We  have  also  considered,  therefore,  the  persistence  of  the  virtual  phase  in  a 
uniform  hexagonal  network,  since  the  growth  in  area  of  the  large  cell  or  cells  is 
extremely  rapid,  so  that  some  contraction  of  the  time-scale  may  be  achie\ed. 
The  evolution  for  a  single  defect  is  a  special  case,  for  which  usual  asymptotic 
relations  do  not  apply  [22],  but  as  the  defect  concentration  is  increased,  the 
evolution  of  the  froth  is  closer  to  that  of  the  Voronoi.  Again,  results  for  low  0 
are  reasonably  supportive  of  a  time-independent  value  for  the  area  ratio,  but 
do  depend  on  whether  one  or  more  defects  or  large  cells  are  included  as  part  of 
the  virtual  phase.  No  decay  of  N*{t)/N{t)  can  be  observed  for  size  of  systems 
considered  so  far  (900,  1600  bubbles),  although  larger  hexagonal  systems  with 
low  defect  concentrations  and  small  0  merit  further  study. 
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ith  respect  to  the  topological  probes,  the  topological  charge  is  initially  con¬ 
stant  from  the  second  layer  for  single  defects  of  both  Tl-  and  T2-induced  type 
in  the  hexagonal  froth,  with  the  number  of  cells  in  a  layer  increasing  linearly 
after  the  first  few  layers  as  Kj+  L  =  Kj  +  6  (  j  >  p,  where  p  depends  on  centre 
cell  choice).  However,  we  showed  recently  that,  for  the  Tl-formed  froth,  inclu¬ 
sions  occur  ver>  quickly  in  the  first  few  layers,  so  that  the  structure  becomes 
non-SSI  for  the  remainder  of  the  evolution.  The  T2-formed  froth,  however,  re¬ 
mains  dynamically  SSI ,  with  /(i  ^  0  only  for  the  zeroth,  first  and  second  layers, 
so  that  the  suggestion  that  p2  =  0  is  necessary  for  a  defect-free  froth  is  clearly 
incorrect.  (The  second  moment  p2  does  not  attain  a  constant  value,  so  Equation 
(3)  clearly  does  not  hold,  and  further,  p2  is  also  slow  to  stabilise  for  low  con¬ 
centration  of  defects).  A  more  formal  expression,  relating  two-cell  correlators, 
ai{k,n),  for  nearest-neighbours  in  froth  to  n.m(n)  has  been  given,  [15],  and 
generalised  for  j  and  we  note  that  the  total  number  of  first  neighbours  is  always 
known  for  seeded  disorder  in  the  hexagonal  structures,  so  that  two-cell  correla¬ 
tions  may  be  obtained  for  the  dynamic  T2-formed  froth,  but  not  in  general.  For 
the  case  of  low  defect  concentration,  for  example,  the  percentage  of  topological 
inclusions  between  shell  layers  is  small  prior  to  impact  between  defects.  Nev¬ 
ertheless,  inclusions  will  occur  at  some  stage,  so  that  topological  correlations 
as  a  function  of  the  layer  distance  j  no  longer  apply.  Although  the  (single  de¬ 
fect),  T2-formed,  froth  is  the  only  exception  we  have  found  to  the  general  rule 
for  dynamic  froths,  a  non-SSI  froth  with  a  small  percentage  of  inclusions  has 
statistical  distributions  similar  to  those  for  SSI  froth  and  topological  properties 
may  still  be  exploited  to  some  extent.  For  large  amounts  of  disorder  or  ran¬ 
domness,  more  general  probes  of  the  non-equilibrium  dynamics  seem  indicated, 
although  choice  is  limited  by  the  need  to  reflect  the  froth’s  cellular  structure.  ’ 

4.  Conclusions 

Topological  probes  arise  naturally  in  soap  froth  dynamics  and  shell-structure 
analysis  provides  measures,  which  relate  predominantly  to  SSI  froths.  The 
(single-defect)  T2-formed  hexagonal  froth  is  the  only  example  we  have  found 
of  a  dynamic  SSI  froth,  which  does  not  require  p2  =  0  and  for  which  topo¬ 
logical  relations  for  charge  and  two-cell  correlation  apply  directly.  For  small 
percentage  of  inclusions  between  cell  layers,  however,  non-SSI  and  SSI  froth 
have  similar  statistical  distributions.  Cluster  persistence,  on  the  other  hand, 
provides  a  general  probe  of  non-equilibrium  dynamics,  but  time-scales  required 
to  observe  persistence  decay  are  very  long.  Nevertheless,  numerical  simulation 
results  indicate  that  time-independent  values  are  achieved  for  some  persistent 
properties  in  Voronoi  and  uniform  froths,  for  a  range  of  sampling  fractions  tf>  , 
where  persistent  area  is  given  roughly  to  be  <  Am(t)/A{ 0)  >~  0.45  for  the  for¬ 
mer.  This  indicates  the  need  for  further  investigation  of  persistence  properties 
for  networks,  where  the  amount  of  seeded  disorder  can  be  controlled. 
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Abstract.  Branch-and-prune  (BP)  and  branch-and-bound  (BB)  tech¬ 
niques  are  commonly  used  for  intelligent  search  in  finding  all  solutions, 
or  the  optimal  solution,  within  a  space  of  interest.  The  corresponding 
binary  tree  structure  provides  a  natural  parallelism  allowing  concurrent 
evaluation  of  subproblems  using  parallel  computing  technology.  Of  spe¬ 
cial  interest  here  are  techniques  derived  from  interval  analysis,  in  partic¬ 
ular  an  interval-Newton/generalized-bisection  procedure.  In  this  context, 
we  discuss  issues  of  load  balancing  and  work  scheduling  that  arise  in  the 
implementation  of  parallel  BB  and  BP,  and  describe  and  analyze  tech¬ 
niques  for  this  purpose.  These  techniques  are  applied  to  solve  problems 
appearing  in  chemical  process  engineering  using  a  distributed  parallel 
computing  system.  Results  show  that  a  consistently  high  efficiency  can 
be  achieved  in  solving  nonlinear  equations,  providing  excellent  scalabil¬ 
ity.  The  effectiveness  of  the  approach  used  is  also  demonstrated  in  the 
consistent  superlinear  speedup  observed  in  performing  global  optimiza¬ 
tion. 


1  Introduction 

The  continuing  success  of  the  chemical  and  petroleum  processing  industries  de¬ 
pends  on  the  ability  to  design  and  operate  complex,  highly  interconnected  plants 
that  are  profitable  and  that  meet  quality,  safety,  environmental  and  other  stan¬ 
dards.  Towards  this  goal,  process  modeling,  simulation  and  optimization  tools 
are  increasingly  being  used  industrially  in  every  step  of  the  design  process  and  in 
subsequent  plant  operations.  To  perform  realistic  and  reliable  process  simulation 
and  optimization  for  industrial  scale  processes,  however,  requires  very  large  scale 
computational  resources.  Parallel  computing  technology  offers  the  potential  to 
provide  the  necessary  computational  power.  However,  since  most  currently  used 
problem  solving  techniques  in  process  modeling  and  optimization  were  devel¬ 
oped  for  use  on  conventional  serial  machines,  it  is  often  necessary  to  rethink 
problem  solving  strategies  in  order  to  take  full  advantage  of  parallel  computing 
technology. 

*  Author  to  whom  all  correspondence  should  be  addressed.  Fax:  (219)  631-8366;  E- 
mail:  markst@nd.edu 
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In  this  context,  we  are  particularly  interested  in  the  use  of  parallel  computing 
technology  to  address  reliability  issues  that  arise  in  solving  process  engineering 
problems.  The  models  that  must  be  solved  in  process  simulation  problems  are 
typically  highly  nonlinear  and  may  have  multiple  solutions.  The  goal  is  to  find 
M  solutions,  to  insure  that  the  solution  or  solutions  of  interest  are  not  missed. 
Similarly,  in  optimization  problems,  the  nonlinear  programming  problems  to 
be  solved  are  typically  nonconvex,  and  there  may  be  several  local  optima.  The 
goal  is  to  find  the  global  optimum,  though  in  some  problems  finding  all  of  the 
local  optima  may  be  of  interest  as  well.  The  approach  we  apply  involves  the 
use  of  interval  analysis,  combined  with  branch-and-prune  (BP)  or  branch-and- 
bound  (BB)  strategies.  Properly  implemented,  such  techniques  can  find,  or  more 
precisely  enclose ,  all  solutions  to  a  system  of  nonlinear  equations,  and  can  be 
used  to  enclose  the  global  optimum,  or  all  local  optima,  in  optimization  problems. 
This  can  be  done  with  mathematical  and  computational  certainty. 

Since  the  subproblems  (tree  nodes)  generated  in  the  tessellation  step  in  BB 
and  BP  algorithms  are  independent,  these  techniques  are  particularly  amenable 
to  parallel  processing.  In  this  paper,  we  focus  specifically  on  issues  of  load  balanc¬ 
ing  and  scheduling  that  arise  in  the  implementation  of  parallel  BB  and  BP,  and 
describe  and  analyze  techniques  for  this  purpose.  An  application  to  a  problem 
arising  in  chemical  process  engineering  is  used  to  demonstrated  the  effectiveness 
of  the  approach  used. 


2  Distributed  Parallel  Computing 

The  solution  of  realistic,  industrial-scale  simulation  and  optimization  problems 
is  computationally  very  intense,  and  requires  the  use  of  adequate  computational 
resources  to  be  done  in  a  timely  manner.  High  performance  computing  (HPC) 
technology,  in  particular  parallel  computing,  provides  the  computational  power 
to  realistically  model,  simulate,  design  and  optimize  complex  chemical  manu¬ 
facturing  processes.  To  better  use  these  leading  edge  technologies  in  process 
simulation  requires  the  use  of  techniques  that  efficiently  exploit  parallel  compu¬ 
tational  resources.  One  of  major  trends  in  this  regard  is  the  use  of  distributed 
computing  systems.  Typically,  in  this  sort  of  system,  memory  is  physically  dis¬ 
tributed,  and  communication  may  be  done  by  message  passing  through  some 
interconnection  network. 

The  use  of  parallel  processing  in  chemical  engineering  has  attracted  signif¬ 
icant  attention  over  the  past  decade  or  so.  There  are  a  variety  of  applications 
for  which  a  distributed  approach  to  parallel  computing  has  proven  to  be  effec¬ 
tive.  In  chemical  process  systems  engineering,  some  examples,  that  involve  either 
actual  implementation  on  distributed  systems,  or  algorithms  appropriate  for  dis- 
tiibuted  computing,  can  be  seen  in  the  field  of  deterministic  global  optimization 
and  reliable  nonlinear  equation  solving  (e.g.,  [1-9]),  nondeterministic  global  op¬ 
timization  (e.g.,  [10-12]),  BB  in  process  scheduling  (e.g.,  [13-16]),  BB  in  process 
synthesis  (e.g.,  [10,17-19]),  and  process  simulation,  analysis  and  optimization 
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(e.g.,  [20-39]).  There  are  also  a  number  of  important  application  areas  outside 
of  process  systems  engineering  (e.g.,  [40-45]). 

The  type  of  distributed  parallel  system  of  particular  interest  here  is  a  cluster 
of  workstations  (COW),  in  which  multiple  workstations  on  a  network  are  used 
as  a  single  parallel  computing  resource.  This  sort  of  parallel  computing  system 
has  advantages  since  it  is  relatively  cheap  economically,  and  is  based  on  widely 
available  hardware.  Thus,  such  an  approach  to  parallel  computing  has  become  a 
important  trend  in  providing  high  performance  computing  resources  in  science 
and  engineering. 

3  Branch-and-Bound 

Branch-and-prune  (BP)  and  branch-and-bound  (BB)  algorithms  are  general- 
purpose  intelligent  search  techniques  for  finding  all  solutions,  or  the  optimal 
solution,  within  a  space  of  interest,  and  have  a  wide  range  of  applications.  These 
techniques  employ  successive  decomposition  (tesselation)  of  the  global  problem 
into  smaller  disjoint  or  independent  subproblems  that  are  solved  recursively  until 
all  solutions,  or  the  optimal  solution,  are  found.  BB  and  BP  have  important 
applications  in  engineering  and  science,  especially  when  a  global  solution  to  an 
optimization  problem,  or  all  solutions  to  a  nonlinear  equation  solving  problem 
are  sought.  In  chemical  engineering,  these  applications  include  process  synthesis 
(e.g.,  [10,17-19]),  process  scheduling  (e.g.,  [13-16]),  analysis  of  phase  behavior 
(e.g.,  [46-48]),  and  molecular  modeling  (e.g.,  [49]). 

In  BP,  a  subproblem  is  typically  processed  in  some  way  to  verify  the  existence 
of  a  feasible  solution.  The  subproblem  may  be  examined  by  a  series  of  tests,  and 
is  pruned  when  it  fails  specified  criteria  or  if  a  unique  solution  can  be  found  inside 
this  subdomain.  If  no  conclusion  is  available,  and  so  the  subproblem  cannot  be 
pruned,  the  problem  is  bisected  into  to  two  additional  subproblems  (nodes), 
generating  a  binary  tree  structure.  One  of  the  subproblems  is  then  put  in  a 
stack  and  tests  are  continued  on  the  other.  This  type  of  BP  procedure  is  one 
of  the  basic  ideas  underlying  the  application  of  interval  analysis  to  equation¬ 
solving  problems.  More  details  on  interval  analysis,  in  the  particular  interval- 
Newton/generalized-bisection  (IN/GB)  method,  are  presented  in  next  section. 
When  solving  a  system  of  nonlinear  equations,  the  pruning  scheme  consists  of 
a  function  range  test  and  the  interval-Newton  existence  and  uniqueness  test. 
There  are  three  situations  in  which  an  interval  (node)  can  be  pruned:  (1)  zero 
is  not  contained  in  any  component  of  the  function  range;  (2)  a  unique  solution 
is  proven  to  be  enclosed,  and  (3)  it  is  proven  that  no  solutions  exist.  With  these 
pruning  criteria,  a  scheme  can  be  constructed  that  searches  the  entire  binary 
tree  and  finds  all  solutions  of  the  equation  system. 

In  BB,  the  goal  is  typically  to  find  a  globally  optimal  solution  to  some  prob¬ 
lem.  BB  may  be  built  on  top  of  BP  schemes  by  enbedding  an  additional  pruning 
test.  In  this  test,  a  node  is  pruned  when  its  optimal  (lower  bounding)  solution  is 
guaranteed  to  be  worse  (greater)  than  some  known  current  best  value  (an  upper 
bound  on  the  global  minimum).  Thus,  one  avoids  visiting  subproblems  which 
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are  known  not  to  contain  the  globally  optimal  solution.  In  this  context,  various 
heuristic  schemes  may  be  of  considerable  importance  in  maintaining  search  effi¬ 
ciency.  For  example,  when  solving  global  minimization  problems  using  interval 
analysis,  the  best  upper  bound  value  may  be  generated  and  updated  by  some 
heuristic  combination  of  an  interval  extension  of  the  objective  function,  a  point 
objective  function  evaluation  with  interval  arithmetic,  and  a  local  minimization 
with  a  verification  by  interval  analysis.  In  order  to  enhance  bounding  and  prun¬ 
ing  efficiency,  some  approaches  also  apply  a  priority  list  scheme  in  BB.  Typically, 
all  problems  in  the  stack  are  rearranged  in  the  order  of  some  importance  index, 
such  as  a  lower  bound  value.  The  idea  is  that  the  most  important  subproblems 
stored  in  the  stack  are  examined  with  higher  priority,  in  the  hope  that  the  global 
optimum  be  found  early  in  the  search  process,  thus  allowing  other  later  subprob¬ 
lems  that  do  not  possess  the  global  optimum  to  be  quickly  pruned  before  they 
generate  new  nodes. 

In  BB  or  BP  search,  the  shape  and  size  of  the  search  space  typically  changes 
as  the  search  proceeds.  Portions  that  contain  a  solution  might  be  highly  ex¬ 
panded  with  many  nodes  and  branches,  while  portions  that  have  no  solutions 
might  be  discarded  immediately,  thus  resulting  in  an  irregularly  structured 
search  tree.  It  is  only  through  actual  program  execution  that  it  becomes  ap¬ 
parent  how  much  work  is  associated  with  individual  subproblems  and  thus  what 
the  actual  structure  of  the  search  tree  is.  Since  the  subproblems  to  be  solved 
are  independent,  execution  of  both  BP  and  BB  on  parallel  computing  systems 
can  clearly  provide  improvements  in  computational  efficiency;  thus  the  use  of 
parallel  computing  to  implement  BP  and  BB  has  attracted  significant  attention 
(e.g.,  [50-56]).  However,  because  of  the  irregular  structure  of  the  binary  tree, 
this  implementation  on  distributed  systems  is  often  not  straightforward.  Details 
concerning  the  methodology  for  implementing  BP  and  BB  on  distributed  parallel 
systems  will  be  discussed  in  later  sections. 


4  Interval  Analysis 


Of  particular  interest  here  are  BP  and  BB  schemes  based  on  interval  analysis. 
A  real  interval  Z  is  defined  as  the  set  of  real  numbers  lying  between  (and  includ¬ 
ing)  given  upper  and  lower  bounds;  i.e.,  Z  =  [ zL ,  zv]  =  {z  G  5ft  |  zL  <  z  <  zu}. 
A  real  interval  vector  Z  =  (Z\,  Z%, . . . ,  Zn)T  has  n  real  interval  components  and 
can  be  interpreted  geometrically  as  an  n-dimensional  rectangle  (box).  Note  that 
in  this  section  lower  case  quantities  are  real  numbers  and  upper  case  quantities 
are  intervals.  Several  good  introductions  to  interval  analysis  are  available  (e.g., 
[57-59]).  In  this  section,  interval  analysis  is  described  in  the  context  of  solving 
nonlinear  parameter  estimation  problems,  since  that  is  the  primary  example  used 
in  the  tests  discussed  later.  However,  it  should  be  emphasized  that  the  interval 
methods  discussed  here  are  general-purpose  and  can  be  used  in  connection  with 
other  objective  functions  in  a  global  optimization  problem  and  other  equation 
systems  in  an  equation  solving  problem. 
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BP  and  BB  techniques  can  be  constructed  using  the  interval-Newton  tech¬ 
nique.  Given  a  nonlinear  equation  system  with  a  finite  number  of  real  roots  in 
some  initial  interval,  this  technique  provides  the  capability  to  find  (or,  more 
precisely,  narrowly  enclose)  all  the  roots  of  the  system  within  the  given  initial 
interval.  For  the  unconstrained  minimization  of  an  objective  function  (or  esti¬ 
mator)  (p{8)  in  parameter  estimation,  a  common  approach  is  to  use  the  gradient 
of  4>{9)  and  seek  a  solution  of  g(0)  =  V</>(0)  =  0  in  order  to  determine  the  op¬ 
timal  parameter  values  8.  The  global  minimum  will  be  a  root  of  this  nonlinear 
equation  system,  but  there  may  be  many  other  roots  as  well,  representing  local 
minima  and  maxima  and  saddle  points.  Thus,  for  this  approach  to  be  reliable, 
the  capability  to  find  all  the  roots  of  g(0)  =  0  is  needed,  and  this  is  provided 
by  the  interval-Newton  technique.  In  practice,  by  using  an  objective  range  test, 
as  discussed  below,  the  interval-Newton  procedure  can  also  be  implemented  as 
a  BB  technique,  so  that  roots  of  g(0)  =  0  that  cannot  be  the  global  minimum 
need  not  be  found.  The  solution  algorithm  is  applied  to  a  sequence  of  intervals, 
beginning  with  some  initial  interval  ©(0)  specified  by  the  user.  This  initial  inter¬ 
val  can  be  chosen  to  be  sufficiently  large  to  enclose  all  physically  feasible  values. 
It  is  assumed  here  that  the  global  optimum  will  occur  at  an  interior  stationary 
minimum  of  d>(9)  and  not  at  the  boundaries  of  0(o).  Since  the  estimator  4>{9)  is 
derived  based  on  a  product  of  Gaussian  distribution  functions  corresponding  to 
each  data  point,  only  a  stationary  global  minimum  is  reasonable  for  statistical 
regression  problems  such  as  considered  here. 

For  an  interval  Q(k)  in  the  sequence,  the  first  step  in  the  solution  algorithm 
is  the  function  range  test.  Here  an  interval  extension  G(0(fc))  of  the  function 
g(0)  is  calculated.  An  interval  extension  provides  upper  and  lower  bounds  on  the 
range  of  values  that  a  function  may  have  in  a  given  interval.  It  is  often  computed 
by  substituting  the  given  interval  into  the  function  and  then  evaluating  the 
function  using  interval  arithmetic.  The  interval  extension  so  determined  is  often 
wider  than  the  actual  range  of  function  values,  but  it  always  includes  the  actual 
range.  If  there  is  any  component  of  the  interval  extension  G(0lfc))  that  does 
not  contain  zero,  then  we  may  discard  (prune)  the  current  interval  (node)  Q(k\ 
since  the  range  of  the  function  does  not  include  zero  anywhere  in  this  interval, 
and  thus  no  solution  of  g(0)  =  0  exists  in  this  interval.  We  may  then  proceed 
to  consider  the  next  interval  in  the  sequence,  since  the  current  interval  cannot 
contain  a  stationary  point  of  <p{9).  Otherwise,  if  0  e  G(0(fc)),  then  testing  of 
0(*f  continues. 

The  next  step  is  the  objective  range  test.  The  interval  extension  <P(0(fe)), 
which  contains  the  range  of  4>(9)  over  0(fc),  is  computed.  If  the  lower  bound  of 
$(©(*))  is  greater  than  a  known  upper  bound  on  the  global  minimum  of  <j>{9), 
then  0(/c'  cannot  contain  the  global  minimum  and  need  not  be  further  tested. 
Otherwise,  testing  of  0(fc)  continues.  The  upper  bound  on  the  objective  function 
used  for  comparison  in  this  step  can  be  determined  and  updated  in  a  number 
of  different  ways.  Here  we  use  point  evaluations  of  d>(0)  done  at  the  midpoint 
of  previously  tested  0  intervals  that  may  contain  stationary  points.  Using  the 
objective  range  test  yields  a  BB  procedure  for  the  global  minimization  of  <j)(9) , 
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while  if  this  step  is  skipped,  we  will  have  a  BP  technique  for  finding  all  solutions 
°f  g(#)  =  0,  i.e.,  all  stationary  points  of  cp(9). 

The  next  step  is  the  interval-Newton  test.  Here  the  linear  interval  equation 
system 

G'(0(fe))( N(*>  -  9(k) )  =  -g(0(fc>) 

is  set  up  and  solved  for  a  new  interval  N<*1,  where  G'{®(k))  is  an  interval 
extension  of  the  Jacobian  of  g(0),  i.e,  the  Hessian  of  0(0),  over  the  current 
interval  0ifc),  and  9{k)  is  a  point  in  the  interior  of  0(fc),  usually  taken  to  be  the 
midpoint.  It  has  been  shown  (e.g,  [57-59])  that  any  root  6*  6  ®(k)  of  g(0)  =  0 
is  also  contained  in  the  image  N(A9,  implying  that  if  there  is  no  intersection 
between  0  and  N  k)  then  no  root  exists  in  ©(fc),  and  suggesting  the  iteration 
scheme  0(fc+1>  =  ®W  n  N<*>.  In  addition  to  this  iteration  step,  which  can  be 
used  to  tightly  enclose  a  solution,  it  has  been  proven  (e.g,  [57-59])  that  if  N(A:)  is 
contained  completely  within  ©<fc>,  then  there  is  one  and  only  one  root  contained 
within  the  current  interval  ©<*>.  This  property  is  quite  powerful,  as  it  provides 
a  mathematical  guarantee  of  the  existence  and  uniqueness  of  a  root  within  an 
interval  when  it  is  satisfied. 

There  are  thus  three  possible  outcomes  to  the  interval-Newton  test,  as  shown 
schematically  for  a  two  variable  problem  in  Figs.  1-  3.  The  first  possible  outcome 
(Fig.  1)  is  that  N<0  c  ®{kK  This  represents  mathematical  proof  that  there  exists 
a  unique  solution  to  g (9)  =  0  within  the  current  interval  ©(*>,  and  that  that 
solution  also  lies  within  the  image  .  This  solution  can  be  rigorously  enclosed, 
with  quadratic  convergence,  by  applying  the  interval-Newton  step  to  the  image 
and  repeating  a  small  number  of  times.  Alternatively,  convergence  to  a  point 
approximation  of  the  solution  can  be  guaranteed  using  a  routine  point-Newton 
method  starting  from  anywhere  inside  of  the  current  interval.  Since  a  unique 
solution  has  been  identified  for  this  subproblem,  it  can  be  pruned,  and  the  next 
interval  in  the  sequence  can  now  be  tested,  beginning  with  the  function  range 
test. 

The  second  possible  outcome  (Fig.  2)  is  that  N(<:)  n  ®{k)  =  0.  This  provides 
mathematical  proof  that  no  solutions  of  g(0)  =  0  exist  within  the  current  in¬ 
terval.  Thus,  the  current  interval  can  be  pruned  and  testing  of  next  interval  can 
begin. 

The  final  possible  outcome  (Fig.  3)  is  that  the  image  N<*>  lies  partially  within 
the  current  interval  ®{kK  In  this  case,  no  conclusions  can  be  made  about  the 
number  of  solutions  in  the  current  interval.  However,  it  is  known  that  any  solu¬ 
tions  that  do  exist  must  lie  in  the  intersection  0(fc>  n  If  the  intersection 
is  sufficiently  smaller  than  the  current  interval,  one  can  proceed  by  reapplying 
the  interval  Newton  test  to  the  intersection.  Otherwise,  the  intersection  is  bi¬ 
sected,  and  the  resulting  two  intervals  added  to  the  sequence  of  intervals  to  be 
tested.  This  approach  is  referred  to  as  an  interval-Newton/generalized-bisection 
(IN/GB)  method,  and  depending  on  whether  or  not  the  objective  range  test  is 
employed,  can  be  interpreted  as  either  a  BB  or  BP  procedure. 

.  ^  sh°uld  be  emphasized  that,  when  machine  computations  with  interval 
arithmetic  operations  are  done,  the  endpoints  of  an  interval  are  computed  with 
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Fig.  1.  The  computed  image  N(fc)  is  a  subset  of  the  current  interval  &(k\  This  is 
mathematical  proof  that  there  is  a  unique  solution  of  the  equation  system  in  the  current 
interval,  and  furthermore  that  this  unique  solution  is  also  in  the  image. 


e< 


Fig.  2.  The  computed  image  has  a  null  intersection  with  the  current  interval 

©<*).  This  is  mathematical  proof  that  there  is  no  solution  of  the  equation  system  in 
the  current  interval. 
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Fig.  3.  The  computed  image  N(t)  has  a  nonnull  intersection  with  the  current  interval 
©  .  Any  solutions  of  the  equation  system  must  lie  in  the  intersection  of  the  image 

and  the  current  interval. 


a  directed  outward  rounding.  That  is,  the  lower  endpoint  is  rounded  down  to 
the  next  machine-representable  number  and  the  upper  endpoint  is  rounded  up  to 
the  next  machine-representable  number.  In  this  way,  through  the  use  of  interval, 
as  opposed  to  floating  point  arithmetic,  any  potential  rounding  error  problems 
are  eliminated,  yielding  an  approach  that  can  provide  a  computational,  not  just 
mathematical,  guarantee  of  reliability.  Overall,  when  properly  implemented,  the 
IN/GB  method  described  above  provides  a  procedure  that  is  mathematically  and 
computationally  guaranteed  to  find  the  global  minimum  of  4>(0),  or,  if  desired, 
to  enclose  all  of  its  stationary  points  (within,  of  course,  the  specified  initial 
parameter  interval  0(°)). 


5  Dynamic  Load  Balancing  and  Work  Scheduling 

As  noted  above,  since  the  subproblems  to  be  solved  are  independent,  the  execu¬ 
tion  of  interval- Newton  techniques,  whether  BP  or  BB,  on  distributed  parallel 
systems  can  clearly  provide  improvements  in  computational  efficiency.  And  since, 
for  practical  problems,  the  binary  tree  that  needs  to  be  searched  may  be  quite 
large,  there  may  in  fact  be  a  strong  motivation  for  trying  to  exploit  the  oppor¬ 
tunity  for  parallel  computing.  However,  because  of  the  irregular  structure  of  the 
binary  tree,  doing  this  may  not  be  straightforward. 

While  executing  a  program  to  assign  the  unprocessed  workload  (stack  boxes) 
to  available  processors,  the  irregularity  of  the  tree  could  cause  a  highly  uneven 
distribution  of  work  among  processors  and  result  in  poor  utilization  of  comput¬ 
ing  resources.  Newly  generated  boxes  at  some  tree  nodes,  due  to  bisection,  could 
cause  some  processors  to  become  highly  loaded  while  others,  if  processing  tree 
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nodes  that  can  be  pruned,  could  become  idle  or  lightly  loaded.  In  this  context,  we 
need  an  effective  dynamic  load  balancing  and  work  scheduling  scheme  to  perform 
the  parallel  tree  search  efficiently.  To  manage  the  load  balancing  problem,  one 
seeks  to  apply  an  optimal  work  scheduling  strategy  to  transfer  workload  (boxes 
to  be  tested)  automatically  from  heavily  loaded  processors  to  lightly  loaded  pro¬ 
cessors  or  processors  approaching  an  idle  state.  The  primary  goal  of  dynamic 
load  balancing  algorithms  is  to  schedule  workload  among  processors  during  pro¬ 
gram  execution,  to  prevent  the  appearance  of  idle  processors,  while  minimizing 
interprocessor  communication  cost  and  thus  maximizing  the  utilization  of  the 
computing  resources. 

A  common  load  balancing  strategy  is  the  “manager-worker”  scheme  (e.g.,  [3, 
4,7,12,19]),  in  which  a  single  “manager”  processor  centrally  conducts  a  group 
of  “worker”  processors  to  perform  a  task  concurrently.  This  scheme  has  been 
popular  in  part  because  it  is  relatively  easy  to  implement.  It  amounts  to  using  a 
centralized  pool  to  buffer  workloads  among  processors.  However,  as  the  number 
of  processors  becomes  large,  such  a  centralized  scheme  could  result  in  a  signif¬ 
icant  communication  overhead  expense,  as  well  as  contention  on  the  manager 
processor.  As  a  result,  in  many  cases,  this  scheme  does  not  exhibit  particularly 
good  scalability.  Thus,  to  avoid  bottlenecks  and  high  communication  overhead, 
we  concentrate  here  on  decentralized  schemes  (without  a  global  stack  manager), 
and  consider  three  types  of  load  balancing  algorithms  specifically  designed  for 
network-based  parallel  computing  using  message  passing. 

These  parallel  algorithms  adopt  a  distributed  strategy  that  allows  each  pro¬ 
cessor  to  locally  make  workload  placement  decisions.  This  strategy  helps  a  pro¬ 
cessor  maintain  for  itself  a  moderate  local  workload  stack,  hopefully  prevent¬ 
ing  itself  from  becoming  idle,  and  alleviates  bottleneck  effects  when  applied  on 
large-scale  multicomputers.  All  distributed  parallel  algorithms  of  this  type  are 
basically  composed  of  five  phases:  workload  measurement,  state  information  ex¬ 
change,  transfer  initiation,  workload  placement,  and  global  termination.  Each  of 
these  phases  is  now  discussed  in  more  detail. 


5.1  Workload  Measurement 

As  the  first  stage  in  a  dynamic  load  balancing  operation,  workload  measurement 
involves  evaluation  of  the  current  local  workload  using  some  “work  index” .  This 
is  a  criterion  that  needs  to  be  calculated  frequently,  and  so  it  must  be  inexpensive 
to  determine.  It  also  needs  to  be  sufficiently  precise  for  purposes  of  making  good 
workload  placement  decisions  later.  In  the  context  of  interval  BP  and  BB,  a  good 
approach  is  to  simply  use  the  stack  length  (number  of  boxes)  as  the  work  index. 
This  index  is  effective  in  parallel  BP  and  BB  scheme  because  of  the  following 
characteristics: 

-  A  long  stack  represents  a  heavy  workload  and  vise  versa. 

-  Exhibiting  an  empty  stack  indicates  the  local  processor  is  approaching  an 
idle  state. 
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A  precise  representation  of  workload  by  work  index  may  not  be  needed,  since 
it  mat  not  be  necessary  to  maintain  an  equal  workload  on  all  processors,  but 
merely  to  prevent  the  appearance  of  idle  states. 

Thus,  the  stack  length  can  serve  as  a  simple,  yet  effective,  workload  index. 

5.2  State  Information  Exchange 

After  all  processors  identify  their  own  workload  state,  the  parallel  algorithm 
makes  this  local  information  available  to  all  other  cooperating  processors,  through 
interprocessor  message  passing,  to  construct  a  global  work  index  vector.  The  co¬ 
operating  processors  are  a  group  of  processors  participating  in  load  balancing 
operations  with  a  local  processor,  and  define  the  domain  of  interprocessor  com¬ 
munication,  thereby  determining  a  virtual  network  for  cooperation.  The  range  of 
this  domain  is  critical  in  determining  the  cost  of  communication  and  the  perfor¬ 
mance  of  load  balancing.  One  possibility  is  that  the  cooperating  processors  could 
include  all  processors  available  on  the  network,  and  a  global  all-to-all  communi¬ 
cation  scheme  could  then  used  to  update  global  state  information.  This  provides 
a  very  up-to-date  global  work  index  vector  but  might  come  at  the  expense  of 
high  communication  overhead.  Alternatively,  the  cooperating  processors  might 
include  only  a  small  subset  of  the  available  processors,  with  this  small  subset 
defining  a  local  processor’s  nearest  “neighbors”  in  the  virtual  network.  Now 
one  needs  only  to  employ  cheap  local  point-to-point  communication  operations. 
However,  without  a  well-tailored  and  nested  virtual  network,  and  a  good  load 
balancing  algorithm,  these  local  schemes  could  result  in  workload  imbalance  and 
idle  states. 


5.3  Transfer  Initiation 

After  obtaining  an  overview  of  the  workload  state,  at  least  for  the  group  of 
cooperating  (“neighboring”)  processors,  load  balancing  algorithms  now  need  to 
decide  if  a  workload  placement  is  necessary  to  maintain  balance  and  prevent  an 
idle  state.  This  is  done  according  to  an  initiation  policy  which  dictates  under 
what  conditions  a  workload  (box)  transfer  is  initiated,  and  decides  which  proces¬ 
sors  will  trigger  the  load  balancing  operation.  Generally,  the  migration  of  boxes 
from  one  processor  to  another  processor  is  initiated  on  demand.  In  this  context, 
the  load  balancing  operations  are  event  driven  according  to  different  procedures, 
such  as  a  sender-initiate  scheme  (e.g.,  [60-62]),  a  receiver-initiate  scheme  (e.g., 
[63-65])  and  a  symmetric  scheme  (e.g.,  [2, 66,67]).  In  the  sender-initiate  scheme, 
when  the  workload  of  any  processor  is  too  heavy  and  exceeds  an  upper  threshold, 
the  overloaded  processor  will  offload  some  of  its  stack  boxes  to  another  processor 
through  the  network.  The  receiver-initiate  approach  works  in  the  opposite  way 
by  having  an  underloaded  processor  request  boxes  from  heavily  loaded  proces¬ 
sors,  when  the  underloaded  processor’s  workload  is  less  than  a  lower  threshold. 
The  symmetric  scheme  combines  the  previous  two  strategies  and  allows  both 
underloaded  and  overloaded  processors  to  initiate  load  balancing  operations. 
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5.4  Workload  Placement 

The  next  step  of  load  balancing  algorithm  is  to  complete  a  workload  placement. 
Here  the  donor  processor  splits  the  local  stack  into  two  parts,  sending  one  part 
to  the  requesting  processor  and  retaining  the  other.  This  operation  is  done  ac¬ 
cording  to  a  transfer  policy  consisting  of  two  rules:  a  work-adjusting  rule  and 
a  work-selection  rule.  The  work-adjusting  rule  determinates  how  to  distribute 
workload  among  processors  and  how  many  stack  boxes  are  to  be  transferred. 
If  the  requesting  processor  receives  too  little  work,  it  may  quickly  become  idle; 
if  the  donor  processor  offloads  too  much  work,  it  itself  could  also  become  idle. 
In  either  case,  the  result  would  eventually  intensify  the  communication  needed 
to  perform  later  load  balancing  operations.  Many  approaches  are  available  for 
this  rule.  One  simple  approach  is  to  transfer  a  constant  number  of  work  units 
(boxes)  upon  receiving  a  request,  such  as  in  a  work  stealing  strategy  (e.g.  [68]). 
A  more  sophisticated  approach  is  to  adopt  a  diffusive  propagation  strategy  (e.g. 
[69-71]),  which  takes  into  account  the  workload  states  on  both  sides  and  adjusts 
the  workload  dynamically  with  a  mechanism  analogous  to  heat  or  mass  diffusion. 

In  addition  to  the  quantity  of  workload,  as  measured  by  the  work  index, 
the  “quality”  of  transferred  boxes  is  also  an  important  issue.  In  this  context,  a 
work-selection  rule  is  applied  to  select  the  most  suitable  boxes  to  transmit  in 
order  to  supply  adequate  work  to  the  requesting  processor,  and  thus  reduce  the 
demands  for  further  load  balancing  operations  later.  Although  it  is  difficult  to 
precisely  estimate  the  size  of  the  tree  (or  total  work)  rooted  at  an  unexamined 
node  (box),  many  heuristic  rules  have  been  proposed  to  select  the  appropriate 
boxes.  One  rule-of-thumb  is  to  transmit  boxes  near  the  initial  root  of  the  overall 
binary  tree,  because  these  boxes  tend  to  have  more  future  work  associated  with 
the  subsequent  tree  rooted  at  them  (e.g.,  [72]).  While  this  has  been  demonstrated 
to  be  a  good  selection  rule  in  many  tree  search  applications,  this  and  other  such 
selection  rules  will  not  necessarily  have  a  strong  influence  on  the  performance  of 
a  parallel  BP  algorithm  applied  to  solve  equation-solving  problems  using  interval 
analysis.  However,  the  selection  rule  used  can  have  a  strong  impact  on  a  parallel 
BB  algorithm  when  solving  global  minimization  problems,  since  by  affecting  the 
evaluation  sequence  of  boxes  it  in  turn  affects  the  time  at  which  good  upper 
bounds  on  the  global  minimum  are  identified.  In  general,  the  earlier  a  good 
upper  bound  on  the  global  minimum  can  be  found,  the  less  work  that  needs 
to  be  done  to  complete  the  global  minimization,  since  this  means  it  is  more 
likely  that  boxes  can  be  pruned  using  an  objective  range  test.  This  issue  will  be 
addressed  in  more  detail  in  a  later  section. 


5.5  Global  Termination 

Parallel  computation  will  be  terminated  when  the  globally  optimal  solution  for 
BB  problems,  or  all  feasible  solutions  for  BP  problems,  have  been  found  over  the 
entire  binary  tree,  making  all  processors  idle.  For  a  synchronous  parallel  algo¬ 
rithm,  global  termination  can  be  easily  detected  through  global  communication 


-473  - 


FEUP  -  F aculdcide  de  Engenharia  da  Universidade  do  Porto 


or  periodic  state  information  exchange.  However,  detecting  the  global  termina¬ 
tion  stage  is  a  more  difficult  task  for  an  asynchronous  distributed  algorithm,  not 
only  because  of  the  lack  of  global  or  centralized  control,  but  also  because  there 
is  a  need  to  guarantee  that  upon  termination  no  unexamined  workload  remains 
in  the  communication  network  due  to  message  passing.  One  commonly  used  ap¬ 
proach  that  provides  a  reliable  and  robust  solution  to  this  problem  is  Dijkstra’s 
token  termination  detection  algorithm  [53, 73, 74], 

6  Implementation  of  Dynamic  Load  Balancing 
Algorithms 


In  this  section,  a  sequence  of  three  algorithms  is  described  for  load  balancing  in  a 
binary  tree,  with  each  algorithm  in  the  sequence  representing  an  improvement  in 
principle  over  the  previous  one.  The  last  method  represents  a  combination  of  the 
most  attractive  and  effective  strategies  adapted  from  previous  research  studies, 
and  also  incorporates  some  novel  strategies  in  this  context.  Interprocessor  com¬ 
munication  is  performed  using  the  MPI  protocol  [75,76],  a  very  powerful  and 
popular  technique  for  massage  passing  operations  that  provides  various  commu¬ 
nication  functions  as  discussed  below.  In  the  subsequent  section,  the  performance 
of  the  three  algorithms  described  will  be  compared. 

6.1  Synchronous  Work  Stealing  (SWS) 

This  first  workload  balancing  algorithm  applies  a  global  strategy,  and  is  illus¬ 
trated  in  Fig.  4.  All  processors  are  synchronized  in  the  interleaving  computation 
and  communication  phases.  Synchronous  blocking  all-to-all  communication  is 
used  to  periodically  (after  some  number  of  tests  on  boxes)  update  the  global 
workload  state  information.  Then,  every  idle  processor,  if  there  are  any,  “steals” 
one  unit  of  workload  (one  box)  from  the  processor  with  the  heaviest  workload 
(the  largest  number  of  stack  boxes),  applying  a  receiver-initiate  scheme.  As  the 
responsibility  for  the  workload  placement  decision  is  given  to  each  individual 
processor,  rather  than  in  a  centrally  controlling  manager  processor,  but  global 
communication  is  maintained,  SWS  can  be  regarded  as  a  type  of  distributed 
manager/ worker  scheme. 

The  global,  all-to-all  communication  used  in  this  approach  provides  for  an 
easy  determination  of  workload  dynamics,  and  may  lead  to  a  good  global  load 
balancing.  However,  like  the  centralized  manager/ worker  scheme,  this  conve¬ 
nience  also  comes  at  the  expense  of  increased  communication  cost  when  using 
many  processors.  Such  costs  may  result  in  intolerable  communication  overhead 
and  degradation  of  overall  performance  (speedup).  It  should  also  be  noted  that 
the  synchronous  and  blocking  properties  of  the  communication  scheme  may  cause 
idle  states  in  addition  to  those  that  might  arise  due  to  an  out-of-work  condition. 
When  using  the  synchronous  scheme,  a  processor  (sender)  that  has  reached  the 
synchronization  point  and  is  ready  for  communication  needs  to  stay  idle  and 
wait  for  another  processor  (receiver)  to  reach  the  same  status,  and  then  initiate 
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Fig.  4.  The  SWS  algorithm  uses  global  all-to-all  communication  to  synchronize  com¬ 
putation  and  communication  phases. 

the  communication  together.  Additional  waiting  states  may  occur  due  to  the  use 
of  blocking  communication,  since  a  message-passing  operation  may  not  complete 
and  return  control  to  the  sending  processor  until  the  data  has  been  moved  to 
the  receiving  processor  and  a  receive  posted.  Thus,  the  main  difficulties  with  the 
SWS  approach  are  the  communication  overhead  and  the  likely  occurrence  of  idle 
states,  with  together  may  result  in  poor  scalability.  However,  one  advantage  to 
this  approach  is  that  the  global  communication  makes  it  easy  to  detect  global 
termination. 


6.2  Synchronous  Diffusive  Load  Balancing  (SDLB) 

This  second  approach  for  workload  balancing  follows  a  localized  strategy,  by 
using  local,  point-to-point  communication  and  a  local  cooperation  strategy  in 
which  load  balancing  operations  are  limited  to  a  local  domain  of  cooperating  pro¬ 
cessors,  i.e.,  a  group  of  “nearest  neighbors”  on  some  predefined  virtual  network. 
A  diffusive  work-adjusting  rule  is  also  applied  here  to  dynamically  coordinate 
workload  transmission  between  processors,  thereby  achieving  a  workload  balance 
with  a  mechanism  analogous  to  heat  or  mass  diffusion,  as  illustrated  in  Fig.  5. 

Instead  of  using  global  communication,  point-to-point  synchronous  blocking 
communication  is  used  to  exchange  workload  state  information  among  cooper¬ 
ating  (neighbor)  processors.  The  gathered  information  allows  a  given  processor 
to  construct  its  own  work  index  vector  indicating  the  workload  distribution  in 
its  neighborhood.  Then,  the  algorithm  uses  a  symmetric  initiation  scheme  to 
cause  the  workload  (boxes)  to  “diffuse”  from  processors  with  relatively  heavy 
workloads  to  processors  with  relatively  light  workloads,  in  order  to  maintain  a 
roughly  equivalent  workload  over  all  processors.  The  virtual  network  used  ini¬ 
tially  here  is  simply  a  ring,  which  gives  each  processor  two  nearest  neighbors. 
Each  local  processor,  i,  adjusts  its  local  workload  with  a  neighbor,  j,  according 
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Fig.  5.  SDLB  uses  a  diffusive  work-adjusting  scheme  to  share  workload  among  neigh¬ 
bors  in  the  virtual  network.  It  is  synchronous  like  SWS. 


to  the  rule 

u(j)  =  C[workflg(i)  -  workflg(j)], 

where  u(j)  is  the  workload-adjusting  index,  C  is  a  “diffusion  coefficient”  and 
work  fig  is  the  work  index  vector.  If  u(j)  is  positive  and/or  greater  than  a 
threshold,  the  local  processor  sends  out  workload  (boxes);  if  u{j)  is  negative 
and/or  less  than  a  threshold,  the  local  processor  receives  workload  (boxes).  The 
diffusion  coefficient,  C,  is  a  heuristic  parameter  determining  what  fraction  of 
local  work  to  offload,  and  is  set  at  0.5  in  our  applications.  This  diffusive  scheme 
has  two  advantages.  First,  when  applied  at  an  appropriate  frequency,  it  pro¬ 
vides  some  certainty  in  preventing  the  appearance  of  out-of-work  idle  states. 
Also,  compacting  multiple  units  of  workload  (boxes)  together  for  transmission 
enlarges  the  virtual  grain  of  the  transmitted  messages.  The  use  of  coarse-grained 
messages  to  reduce  communication  frequency  tends  to  minimize  the  effect  of  high 
latency  in  network  transmission,  especially  on  Ethernet.  For  example,  less  total 
time  is  wasted  in  startup  time  of  transmission,  thus  lowering  the  average  trans¬ 
mission  cost  of  a  work  unit  (box),  as  well  as  the  ratio  of  communication  time  to 
computation  time.  It  should  be  noted  that  in  considering  message  grain  there 
may  also  be  maximum  message  size  considerations. 

Though  the  use  of  a  local  communication  scheme  will  reduced  communica¬ 
tion  cost  to  some  extent,  the  use  again  of  synchronous  and  blocking  communi¬ 
cation  operations  are  still  difficulties  in  achieving  good  scalability.  On  the  other 
hand,  while  using  local  rather  than  global  communication  makes  the  detection  of 
global  termination  less  efficient,  the  synchronous  and  blocking  properties  make 
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this  relatively  straightforward.  Since  the  problem  of  detecting  global  termina¬ 
tion  becomes  more  difficult  as  the  number  of  processors  grows,  this  is  another 
important  issue  in  scalability. 

6.3  Asynchronous  Diffusive  Load  Balancing  (ADLB) 

In  this  third  load  balancing  approach,  a  local  communication  strategy  and  dif¬ 
fusive  work-adjusting  scheme  are  used,  as  in  SDLB.  However,  a  major  difference 
here  is  the  use  of  an  asynchronous  nonblocking  communication  scheme,  one  of 
the  key  capabilities  of  MPI.  The  combination  of  asynchronous  communication 
functionality  and  nonblocking,  persistent  communication  functionality  not  only 
provides  for  cheaper  communication  operations  by  eliminating  communication 
idle  states,  but  also,  by  breaking  process  synchronization,  makes  the  sequence  of 
events  in  the  load  balancing  scheme  flexible  by  allowing  overlap  of  communica¬ 
tion  and  computation.  As  illustrated  in  Fig.  6.,  when  each  processor  can  perform 
communication  arbitrarily  at  any  time,  and  independently  of  a  cooperating  pro¬ 
cessor,  all  communication  operations  can  be  scattered  among  computation,  with 
less  time  consumed  in  massage  passing. 

Pi 
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Comp. 

Comm. 

Comp. 

Comm. 

Comp. 
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(Flexible  sequence) 
Send  out  state  info. 

Receive  state  info. 
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Fig.  6.  ADLB  uses  an  asynchronous,  nonblocking  communication  scheme,  providing 
more  flexibility  to  each  processor  and  overlapping  communication  and  computation 
phases. 


In  addition  to  the  cheaper  and  more  flexible  communication  scheme,  we  in¬ 
corporate  into  the  ADLB  approach  two  new  strategies  to  try  to  reduce  the 
demand  for  communication  and  thereby  try  to  achieve  a  higher  overall  perfor¬ 
mance.  First,  as  noted  above,  in  BP  and  BB  methods,  it  is  not  really  necessary 
to  maintain  a  completely  balanced  workload  across  processors.  The  actual  goal 
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is  to  prevent  the  occurrence  of  idle  states  by  simply  maintaining  a  workload  to 
each  processor  sufficiently  large  to  keep  it  busy  with  computation.  To  achieve 
balanced  workloads  may  require  a  very  large  number  of  workload  transmissions, 
resulting  in  a  heavy  communication  burden.  However,  in  this  case,  many  of  the 
workload  transmissions  may  be  unnecessary,  since  in  BP  and  BB  each  proces¬ 
sor  deals  with  its  stack  one  work  unit  (box)  at  a  time  sequentially,  leaving  all 
other  workload  simply  standing  by.  For  a  processor  to  avoid  an  idle  state,  and 
thus  have  a  high  efficiency  in  computation,  it  is  not  necessary  that  its  workload 
be  balanced  with  other  processors,  but  only  that  it  be  able  to  obtain  additional 
workload  from  another  processor  through  communication  as  it  is  approaching  an 
out-of-work  state.  Thus,  we  use  here  a  receiver-initiate  scheme  to  initiate  work 
transfer  only  when  the  number  of  boxes  in  a  processor’s  stack  is  lower  than  some 
threshold,  which  should  be  set  high  enough  that  the  processor  is  not  likely  to 
complete  the  work  and  become  idle  during  the  processing  of  workload  request 
to  its  neighboring  processors. 

As  a  consequence,  we  can  also  implement  a  second  strategy,  which  eliminates 
the  periodic  state  information  exchange  and  combines  the  load  state  information 
of  the  requesting  processor  with  the  workload  request  message  to  the  donor  pro¬ 
cessor.  Upon  receiving  the  request,  the  donor  follows  a  diffusive  work-adjusting 
scheme  as  described  above  for  the  SDLB  approach,  but  with  a  modification  in 
the  response  to  the  workload  adjusting  index.  Here,  if  u(j)  is  positive  and/or 
greater  than  a  threshold,  the  donor  sends  out  workload  (boxes)  to  the  requesting 
processor;  otherwise,  it  responds  that  there  is  no  extra  workload  available.  Thus, 
w  ^en  aPProaching  idle,  a  processor  sends  out  a  request  for  work  to  all  its  coop¬ 
erating  neighbors,  and  waits  for  any  processor’s  donation  of  work.  In  case  of  no 
work  being  transferred,  it  means  that  the  neighbor  processors  are  also  starved 
for  work  and  are  making  work  requests  to  other  neighbors.  In  this  case,  the  pro¬ 
cessor  will  keep  requesting  work  from  the  same  neighbors  until  they  eventually 
obtain  extra  work  from  remote  processors  and  are  able  to  donate  parts  of  it. 
Through  such  a  diffusive  mechanism,  heavily  loaded  processors  can  propagate 
workload  to  lightly  loaded  processors  with  a  small  communication  expense. 

The  last  step  of  this  load  balancing  procedure  is  to  detect  global  termination. 
Because  the  ADLB  scheme  is  asynchronous,  the  detection  of  global  termination 
is  a  more  complex  issue  than  in  the  synchronous  case.  As  noted  above,  a  popular 
and  effective  technique  for  dealing  with  this  issue  is  Dijkstra’s  token  algorithm 
[53, 73,  74].  This  is  the  technique  used  in  the  ADLB  scheme. 

In  the  next  section,  we  describe  tests  of  the  three  approaches  outlined  above 
for  load  balancing  in  parallel  BP  and  BB. 


7  Computational  Experiments  and  Results 

7.1  Test  Environment 

The  performance  of  an  algorithm  on  a  parallel  computing  system  is  not  only- 
dependent  on  the  problem  characteristics  and  the  number  of  processors  but  also 
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on  how  processors  interact  with  each  other,  as  determined  both  by  a  physical 
architecture  in  hardware  and  a  virtual  architecture  in  software.  The  physical  ar¬ 
chitecture  used  in  these  tests,  as  illustrated  in  Fig.  7,  is  a  network-based  system, 
comprising  16  Sun  Ultra  l/140e  workstations,  physically  connected  by  switched 
Ethernet.  As  noted  above,  in  comparison  to  mainframe  systems,  such  a  cluster 
of  workstations  (COW)  has  advantages  in  its  relatively  low  expense  and  easy 
availability  of  hardware.  However,  depending  on  the  communication  bandwidth 
and  on  the  communication  demands  of  the  algorithm  being  executed,  network 
contention  can  have  a  serious  impact  on  the  performance  of  such  a  system,  par¬ 
ticularly  if  the  number  of  processors  is  large. 

M  M  M  M . 

...  $  p  S  P  $  P  $  P  *  •  * 


SWITCHED  ETHERNET 

Fig.  7.  Physical  hardware  used  is  a  cluster  of  workstations  connected  by  switched 
Ethernet. 


Two  types  of  virtual  network  are  used:  an  all-to-all  network  (Fig.  8(a))  in 
the  case  of  SWS,  and  a  one-dimensional  torus  (ring)  network  (Fig.  8(b))  in  the 
cases  of  SDLB  and  ADLB.  In  the  SWS  algorithm,  the  all-to-all  network  is  im¬ 
plemented  by  the  use  of  global,  all-to-all  communication.  However,  in  the  SDLB 
and  ADLB  algorithms,  in  order  to  reduce  communication  demands  and  alleviate 
potential  network  contention,  we  only  use  point-to-point  local  communication 
functions  and  implement  the  ring  network.  The  load  balancing  algorithms  and 
test  problems  were  implemented  in  FORTRAN-77  using  the  MPI  protocol  [75, 
76]  for  interprocessor  communication. 


7.2  Test  Problem 

The  test  problem  used  is  a  global  nonlinear  parameter  estimation  problem  involv¬ 
ing  a  vapor-liquid  equilibrium  (VLE)  model  (Wilson’s  equation).  Such  models, 
and  the  estimation  of  parameters  in  them,  are  important  in  chemical  process  en¬ 
gineering,  since  they  are  the  basis  for  the  design,  simulation  and  optimization  of 
widely-used  separation  processes  such  as  distillation  [48].  In  this  particular  prob¬ 
lem,  we  use  as  the  objective  function  the  maximum  likelihood  estimator,  with 
two  unknown  standard  deviations,  to  determine  two  model  parameters  giving 
the  globally  optimal  fit  of  the  data  to  the  model  [77].  In  addition  to  the  difficult 
nonlinear  objective  function,  the  problem  data  and  characteristics  were  chosen  to 
make  this  a  particularly  difficult  problem.  Interval  analysis,  as  described  above, 
is  used  to  guarantee  the  correct  global  solution.  The  problem  can  be  solved  in 
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Fig.  8.  Virtual  network  in  load  balancing:  (a)  all-to-all  network  using  global  commu¬ 
nication;  used  for  SWS;  (b)  1-D  torus  network  using  local  communication;  used  for 
SDLB  and  ADLB. 


either  of  two  ways.  One  approach  is  to  treat  it  as  a  nonlinear  equation  solving 
problem,  and  use  the  parallel  interval  BP  algorithm  to  solve  for  all  stationary 
points  of  the  objective  function  (there  are  five  stationary  points  in  this  problem). 
The  alternative  approach  is  to  treat  it  directly  as  a  global  optimization  problem 
and  use  the  parallel  interval  BB  algorithm.  The  major  difference  between  the 
two  approaches  is  the  use  of  the  objective  range  test  in  the  BB  algorithm. 

7.3  Computational  Results 

This  parameter  estimation  problem  was  solved  using  the  COW  system  described 
above.  During  the  computational  experiments,  the  COW  was  dedicated  exclu¬ 
sively  to  solving  this  problem;  that  is,  there  were  no  other  users  either  on  the 
workstations  or  on  the  network.  Both  the  BP  scheme  solving  for  all  stationary 
points  and  the  BB  scheme  merely  searching  for  the  global  optimum  were  ex¬ 
ecuted  on  up  to  16  processors  using  each  of  the  three  load  balancing  schemes 
described  above.  Both  sequential  and  parallel  execution  times  were  measured 
in  terms  of  the  MPI  wall  time  function,  and  the  performance  of  each  approach 
evaluated  in  terms  of  parallel  speedup  (ratio  of  the  sequential  execution  time  to 
the  parallel  execution  time)  and  parallel  efficiency  (ratio  of  the  parallel  speedup 
to  the  number  of  processors  used). 

For  the  interval  BP  problem  of  finding  all  stationary  points,  the  speedups 
obtained  using  the  three  load  balancing  algorithms,  i.e.  SWS,  SDLB  and  ADLB. 
on  various  number  of  processors  are  shown  in  Fig.  9.  All  five  stationary  points 
were  found  in  every  experiment.  All  points  in  Fig.  9  are  based  on  an  average 
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over  several  runs.  Since  both  the  sequential  runs  and  all  parallel  BP  runs  ex¬ 
plored  the  same  binary  tree  and  treated  an  equivalent  amount  of  total  work, 
the  computational  results  are  repeatable  and  consistent  with  negligible  devia¬ 
tions.  As  expected,  the  ADLB  approach  clearly  outperforms  SWS  and  SDLB, 
exhibiting  only  slightly  sublinear  speedup.  This  can  also  be  seen  in  the  parallel 
efficiency  curves,  as  shown  in  Fig.  10.  While  efficiency  curves  tend  to  decrease  as 
the  number  of  processors  increases,  as  a  consequence  of  the  Amdahl’s  law,  the 
ADLB  procedure  maintains  a  high  efficiency  of  around  95%.  Thus,  with  the  only 
slightly  sublinear  speedup  and  the  very  high  efficiency  on  up  to  16  processors,  it 
seems  likely  that  the  ADLB  algorithm  will  be  highly  scalable  to  larger  numbers 
of  processors. 


Fig.  9.  Comparison  of  load  balancing  algorithms  on  equation  solving  problem:  speedup 
vs.  number  of  processors. 


SWS  exhibits  the  poorest  performance  of  the  three  load  balancing  methods. 
This  is  partly  due  to  a  poor  global  workload  distribution,  resulting  in  a  rela¬ 
tively  large  number  of  out-of-work  idle  states,  and  also  partly  due  to  the  com¬ 
munication  overhead  from  using  the  global  synchronous  blocking  communication 
scheme.  In  SDLB,  the  symmetric  diffusive  work-adjusting  scheme  using  the  local 
communication  scheme  substantially  reduces  out-of-work  idle  states  by  achiev¬ 
ing  an  even  load  balance  and  thus  improving  the  speedup  and  efficiency.  How¬ 
ever,  while  a  local  communication  scheme  is  employed,  the  synchronous  blocking 
communication  functions  used  retain  a  high  communication  cost  and  represent  a 
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Fig.  10.  Comparison  of  load  balancing  algorithms  on  equation  solving  Problem:  effi¬ 
ciency  vs.  number  of  processors. 


scaling  bottleneck.  This  issue  is  addressed  in  ADLB  by  using  asynchronous  non- 
blocking  communication  functions,  allowing  the  overlap  of  communication  and 
computation.  In  addition,  by  working  towards  a  goal  of  maintaining  non-empty 
local  work  stacks  instead  of  an  evenly  balanced  global  workload  distribution, 
ADLB  provides  a  large  reduction  in  network  communication  requirements,  thus 
greatly  reducing  communication  bottlenecks.  The  reduction  of  such  bottlenecks 
in  ADLB  allows  it  to  achieve  a  consistently  high,  nearly  linear  speedup. 

For  solving  the  parameter  estimation  problem  as  a  global  optimization  prob¬ 
lem  with  parallel  interval  BB,  only  the  best  load  balancing  scheme,  ADLB,  was 
employed.  Three  different  runs  using  the  same  problem  were  made  at  two,  four, 
eight  and  16  processors.  The  resulting  speedups  are  shown  in  Fig.  11.  We  first 
observe  that  all  speedups  are  above  the  linear  speedup  line,  with  a  speedup 
over  oO  on  16  processors  in  one  case.  Superlinear  speedup  is  possible  because 
of  the  broadcast  of  least  upper  bounds,  which  may  cause  tree  nodes  (boxes)  to 
be  discarded  earlier  than  in  the  sequential  case,  i.e.  there  is  less  work  to  do  in 
the  parallel  case  than  in  the  sequential  case.  Also,  the  speedups  are  not  exactly 
repeatable  and  may  vary  significantly  from  run  to  run.  This  occurs  because  of 
slightly  different  timing  in  finding  and  broadcasting  improved  upper  bounds  in 
each  run.  Speedup  anomalies,  such  as  the  superlinear  speedups  seen  here,  are  not 
uncommon  in  parallel  BB  search,  provided  the  reduction  in  the  work  required 
in  the  parallel  case  (which  usually  happens  but  not  always)  is  not  outweighed 
by  communication  expenses  or  other  overhead  in  the  parallel  computation. 
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Fig.  11.  Speedup  anomaly  and  superlinear  speedups  are  observed  in  solving  the  global 
optimization  problem  using  the  parallel  BB  algorithm  based  on  ADLB. 


8  Discussion 

The  excellent  performance  of  ADLB  on  the  tests  described  above  provides  mo¬ 
tivation  for  further  improving  the  ADLB  approach  for  execution  on  even  larger 
numbers  of  processors  and  applied  to  different  sizes  of  problems.  One  factor  we 
have  investigated  is  the  effect  of  the  underlying  virtual  network,  which  is  defined 
to  locally  coordinate  neighbor  processors  in  workload  distribution  and  message 
propagation.  Instead  of  using  a  1-D  torus  (ring)  virtual  network,  a  two  dimen¬ 
sional  (2-D)  torus  virtual  network,  as  shown  in  Fig.  12,  has  been  considered  to 
enhance  the  load  balancing  performance.  When  compared  to  the  1-D  torus,  a 
2-D  torus  has  a  higher  communication  overhead  due  to  each  processor  having 
more  neighbors,  but  it  also  has  a  smaller  network  diameter,  \y/F/ 2}  vs.  [P/2\, 
thus  decreasing  the  message  diffusion  distance.  It  is  expected  that  the  trade-off 
between  communication  overhead  and  message  diffusion  distance  ma)  favor  the 
2-D  torus  for  a  larger  number  of  processors. 

To  evaluate  broadly  the  performance  of  different  parallel  algorithms,  it  is 
useful  to  carry  out  a  scalability  analysis,  which  examines  how  well  an  algorithm 
maintains  a  constant  efficiency  as  the  problem  size  and  the  number  of  processors 
increase.  Thus,  we  carried  out  an  experiment  based  on  the  isoefficiency  function 
[53],  which  determines  how  much  problem  size  needs  to  increase  in  proportion 
to  the  number  of  processors  in  order  to  keep  the  efficiency  at  a  constant  level. 
Small  values  of  the  isoefficiency  function  will  correspond  to  better  scalability. 
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Fig.  12.  2-D  torus  virtual  network  is  implemented  in  ADLB  to  achieve  high  scalabilitv 
when  running  over  larger  numbers  of  processors. 


We  have  done  preliminary  experiments,  performing  isoefficiency  analysis  with 
up  to  64  processors,  which  demonstrate  the  better  scalability  of  the  2-D  torus 
virtual  network  on  parallel  BB  and  BP  problems. 

Another  issue  of  interest  in  this  context  is  how  to  improve  the  search  efficiency 
of  interval  BB  for  the  global  optimum.  As  noted  above,  there  are  priority  list 
schemes,  such  as  prioritizing  the  stack  based  on  a  lower  bound  value,  that  have 
been  demonstrated  to  be  useful  in  a  variety  of  branch  and  bound  problems.  A 
difficulty  with  using  lower  bound  values  is  that  these  may  not  be  sufficiently  tight 
to  provide  any  useful  heuristic  ordering  for  the  evaluation  of  stack  boxes.  This 
is  particularly  true  if  the  lower  bound  is  obtained  by  simple  interval  arithmetic, 
which  often  provides  only  loose  bounds  when  applied  to  a  complicated  function. 

Thus,  we  have  developed  another  approach  aimed  at  scheduling  the  stack 
boxes  for  processing.  This  is  a  novel  dual  stack  management  scheme  in  which 
each  processor  maintains  two  stacks,  a  global  stack  and  a  local  stack.  The  local 
stack  is  unprioritized;  that  is,  with  workload  appearing  in  the  same  sequence  as 
it  is  generated  in  the  IN/GB  algorithm.  The  local  processor  draws  its  work  from 
the  local  stack  as  long  as  it  is  not  empty.  This  contributes  a  depth-first  pattern 
to  the  overall  tree  search  process.  The  global  stack  is  also  unprioritized,  and 
is  created  by  randomly  removing  boxes  from  the  local  stack.  The  global  stack 
provides  boxes  for  workload  transmission  to  other  processors.  This  contributes 
breadth  to  the  tree  search  process.  This  dual  stack  management  scheme  has  been 
demonstrated  to  be  capable  of  producing  consistently  high  superlinear  speedups 
in  BB,  reducing  the  variations  from  run  to  run  observed  previously  [78]. 
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9  Concluding  Remarks 

We  have  described  how  load  management  strategies  can  be  used  for  effectively 
solving  interval  BB  and  BP  problems  in  parallel  on  a  network-based  system.  Of 
the  dynamic  load  balancing  algorithms  considered,  the  best  performance  was 
achieved  by  the  asynchronous  diffusive  load  balancing  (ADLB)  approach.  This 
overlaps  computation  and  computation  by  the  use  of  the  asynchronous  non- 
blocking  communication  functions  provided  by  MPI,  and  uses  a  type  of  diffusive 
load-adjusting  scheme  to  prevents  out-of-work  idle  states  while  keeping  commu¬ 
nication  needs  small. 

The  ADLB  algorithm  was  applied  in  connection  with  interval  analysis,  in 
particular  with  an  interval-Newton/generalized  bisection  (IN/GB)  procedure  for 
reliable  nonlinear  equation  solving  and  deterministic  global  optimization.  IN/GB 
provides  the  capability  to  find  (enclose)  all  solutions  in  a  nonlinear  equation  solv¬ 
ing  problem  with  mathematical  and  computational  certainty,  or  the  capability 
to  solve  global  optimization  problems  with  complete  certainty.  The  results  of 
applying  ADLB  in  the  equation  solving  context  have  shown  that  the  parallel  BP 
algorithm  can  achieve  a  nearly  linear  speedup  with  a  consistently  high  efficiency 
around  95%  on  up  to  16  processors  in  a  one-dimensional  torus  virtual  network. 
Preliminary  indications  are  that  ADLB  provides  high  scalability  up  to  64  pro¬ 
cessors,  and  different  sizes  of  problems,  when  using  a  2-D  torus  virtual  network. 
In  the  context  of  global  optimization,  the  parallel  BB  algorithm  achieves  signif¬ 
icantly  superlinear  speedups,  though  is  somewhat  inconsistent  in  the  extent  to 
which  this  occurs.  By  implementing  a  new  dual  stack  management  scheme  in 
connection  with  ADLB  it  appears  that  a  consistently  high  superlinear  speedup 
on  optimization  problems  can  be  obtained. 

Though  the  test  problem  here  was  based  on  a  global  parameter  estimation 
problem,  it  should  be  emphasized  that  the  parallel  IN/GB  method  is  general- 
purpose  and  can  be  used  in  connection  with  a  wide  variety  of  global  optimization 
problems  and  nonlinear  equation  solving  problems.  Also,  the  load  management 
schemes  described  can  be  applied  to  a  wide  variety  of  other  tree  search  prob¬ 
lems  in  chemical  process  engineering,  such  as  in  process  synthesis  and  process 
scheduling. 
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Abstract.  A  parallel  implementation  of  the  specialized  interior-point 
algorithm  for  multicommodity  network  flows  introduced  in  [5]  is  pre¬ 
sented.  In  this  algorithm,  the  positive  definite  systems  of  each  iteration 
are  solved  through  a  scheme  that  combines  direct  factorization  and  a 
preconditioned  conjugate  gradient  (PCG)  method.  Since  the  solution  of 
at  least  k  independent  linear  systems  is  required  at  each  iteration  of  the 
PCG,  k  being  the  number  of  commodities,  a  coarse-grained  parallelliza- 
tion  of  the  algorithm  naturally  arises.  Also,  several  other  minor  steps  of 
the  algorithm  are  easily  parallelized  by  commodity.  An  extensive  set  of 
computational  results  on  a  shared  memory  machine  is  presented,  using 
problems  of  up  to  2.5  million  variables  and  260,000  constraints.  The  re¬ 
sults  show  that  the  approach  is  especially  competitive  on  large,  difficult 
multicommodity  flow  problems. 


1  Introduction 

Multicommodity  flows  are  among  the  most  challenging  linear  problems,  due 
to  the  large  size  of  these  models  in  real  world  applications  (e.g.,  routing  in 
telecommunications  networks).  Indeed,  these  problems  have  been  used  to  test 
the  efficiency  of  early  interior-point  solvers  for  linear  programming  [1].  The  need 
to  solve  very  large  instances  has  led  to  the  development  of  both  specialized 
algorithms  and  parallel  implementations. 

In  this  paper,  we  present  a  parallel  implementation  of  a  specialized  interior- 
point  algorithm  for  multicommodity  flows  [5].  In  this  approach,  the  block-angular 
structure  of  the  coefficient  matrix  is  exploited  for  performing  in  parallel  the  solu¬ 
tion  of  small  linear  systems  related  to  the  different  commodities,  unlike  general- 
purpose  parallel  interior-point  codes  [2,8,17]  where  the  parallelization  effort  is 
focused  on  the  Cholesky  factorization  of  one  large  system.  This  has  already  been 
proposed  [16,9, 13];  however,  all  the  previous  approaches  require  to  compute  and 
factorize  the  Schur  complement.  This  can  become  a  significant  serial  bottleneck, 
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since  this  matrix  is  usually  prohibitively  dense.  Although  this  bottleneck  can 
be  partly  eluded  by  using  parallel  linear  algebra  routines,  our  approach  takes 
a  more  radical  route  by  avoiding  to  form  the  Schur  complement,  and  using  an 
iterative  method  instead.  There  have  been  other  proposals  along  these  lines  [22. 
14],  but  limited  to  the  sequential  case;  also,  so  far  no  results  have  been  shown 
for  these  algorithms.  The  implementation  presented  in  this  paper  significantly 
improves  on  the  preliminary  one  described  in  [6].  There,  only  some  of  the  major 
routines  were  parallelized,  and  less  attention  was  paid  to  communication  and 
data  distribution.  Working  on  these  details  allowed  us  to  obtain  new  and  better 
computational  results. 

From  the  multicommodity  point  of  view,  this  approach  differentiates  itself 
from  most  other  parallel  solvers  [7, 15, 19, 25, 21, 12]  in  that  is  not  based  on  a  de¬ 
composition  approach.  The  structure  of  the  multicommodity  flow  problem  has 
led  to  a  number  of  specialized  algorithms,  most  of  which  share  the  idea  of  de¬ 
composing  in  some  way  the  problem  into  a  set  of  smaller  independent  problems. 
These  are  all  iterative  methods,  where  at  each  step  the  subproblems  are  solved, 
and  their  results  are  used  in  some  way  to  modify  the  subproblems  to  be  solved 
at  the  next  iteration.  Hence,  these  approaches  are  naturally  suited  for  coarse¬ 
grained  parallelization.  Parallel  price-directive  decomposition  approaches  have 
been  proposed  based  on  bundle  methods  [7,19],  analytic  center  methods  [12] 
or  linear-quadratic  penalty  functions  [21],  Parallel  resource-directive  approaches 
are  described  in  [15].  Finally,  experiences  with  a  parallel  interior-point  decom¬ 
position  method  are  presented  in  [25].  A  discussion  of  these  and  other  parallel 
decomposition  approaches  can  be  found  in  [7],  A  general  description  of  the  par¬ 
allelization  of  mathematical  programming  algorithms  can  be  found  in  [3,23], 

The  paper  is  organized  as  follows.  Section  2  presents  the  formulation  of  the 
problem  to  be  solved.  Section  3  outlines  the  specialized  interior-point  algorithm 
for  multicommodity  flows  proposed  in  [5],  including  a  brief  description  of  the 
general  path-following  method.  Section  4  deals  with  the  parallelization  issues  of 
the  algorithm.  Finally,  Section  5  presents  and  discuss  the  computational  results. 


2  Problem  Formulation 


The  multicommodity  flow  problem  requires  to  find  the  least-cost  routing  of  a 
set  of  k  commodities  through  a  network  of  m  nodes  and  n  arcs,  where  the  arcs 
have  an  individual  capacity  for  each  commodity,  and  a  mutual  capacity  for  all 
the  commodities.  The  node-arc  formulation  of  the  problem  is 


min  ]T*=1  c* x i 
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Vectors  x‘  6  IRn  are  the  flow  arrays  for  each  commodity,  while  z0  €  lRn  are  the 
slacks  of  the  mutual  capacity  constraints.  E  €  IRmxn  is  the  node-arc  incidence 
matrix  of  the  underlying  directed  graph,  while  I  denotes  the  n  x  n  identity- 
matrix.  We  shall  assume  that  E  is  a  full  row-rank  matrix:  this  can  always  be 
guaranteed  by  removing  any  of  the  redundant  node  balance  constraints.  c‘  £  1R 
and  u‘  €  1R,!  are  respectively  the  flow  cost  vector  and  the  individual  capacity- 
vector  for  commodity-  i,  while  u  £  IRn  is  the  vector  of  the  mutual  capacities. 
Finally,  6'  £  IRm  is  the  vector  of  supplies/ demands  for  commodity-  i  at  the 
nodes  of  the  net-work. 

The  multicommodity  flow  problem  is  a  linear  program  with  m  =  km  + 
n  constraints  and  n  -  (k  +  l)n  variables.  In  some  real-world  models,  k  can 
be  very  large:  for  instance,  in  many  telecommunication  problems  a  commodity 
represents  the  flow  of  data/voice  between  two  given  nodes  of  the  network,  and 
therefore  k  is  0(m2).  Thus,  the  resulting  linear  program  can  be  huge  even  for 
graphs  of  moderate  size.  However,  the  coefficient  matrix  of  the  problem  is  highly 
structured:  it  has  a  block-staircase  form,  each  block  being  a  node-arc  incidence 
matrix.  Several  methods  have  been  proposed  which  exploit  this  structure;  one  is 
the  specialized  interior-point  algorithm  to  be  described  in  the  next  paragraph. 

3  A  Specialized  Interior-Point  Algorithm 

In  [5],  a  specialized  interior-point  algorithm  for  multicommodity  flows  has  been 
presented  and  tested.  This  algorithm,  and  the  code  that  implements  it,  will  be 
referred  to  as  IPM. 

IPM  is  a  specialization  of  the  path-following  algorithm  for  linear  program¬ 
ming  [26].  Let  us  consider  the  following  linear  programming  problem  in  primal 
form 

min  {  cx  :  Ax  =  b,  x  +  s  =  u,  x,  s  >  0  }  ,  (2) 

where  x  e  IR"  and  s  €  1R”  are  respectively  the  primal  variables  and  the  slacks 
of  the  box  constraints,  u  £  JRfl,  c  £  IR"  and  b  £  IRm  are  respectivelyjdie  upper 
bounds,  the  cost  vector  and  the  right  hand  side  vector,  and  .4  £  IR™*"  is  a  full 
row-rank  matrix.  The  dual  of  (2)  is 

min  {yb  —  wu  :  yA  +  z  —  w  —  c,  z,  w  >  0  }  ,  (3) 

where  y  £  IR™,  2  £  IR"  and  w  £  Hfl  are  respectively  the  dual  variables  of  the 
structural  constraints  Ax  =  b,  the  dual  slacks  and  the  dual  variables  of  the  box 
constraints  x  <  u. 

Replacing  the  inequalities  in  (2)  by  a  logarithmic  barrier  in  the  objective 
function,  with  parameter  /i,  the  KKT  optimality  conditions  of  the  resulting 
problem  are 

rxz  =  y,e  —  XZe  =  0 
rsw  =  ye  —  SWe  =  0 

rf,  =  b  —  Ax  =  0  (4) 

rc  =  c-  (yA  +  z  -w)  =  0 
ru  =  u  —  x  —  s  =  0 
(x,s,z,w)  >  0  , 
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where  e  is  the  vector  of  l’s  of  proper  dimension,  and  each  uppercase  letter 
corresponds  to  the  diagonal  matrix  having  as  diagonal  elements  the  entries  of 
the  corresponding  lowercase  vector.  In  the  algorithm  we  impose  ru  =  0.  i.e. 

5  =  u-x,  thus  eliminating  h  variables. 

The  (unique)  solutions  of  (4)  for  each  possible  p  >  0  describe  a  continuous 
trajectory,  known  as  the  central  path ,  which,  as  p  tends  to  0,  converges  to  the 
optimal  solutions  of  (2)  and  (3).  A  path-following  algorithm  attempts  to  reach 
close  to  these  optimal  solutions  by  following  the  central  path.  This  is  done  by 
performing  a  damped  version  of  Newton’s  iteration  applied  to  the  nonlinear 
system  (4),  as  shown  in  (5).  A  more  detailed  description  of  the  algorithm  can  be 
found  in  many  linear  programming  textbooks,  e.g.  [26], 


(5) 


The  main  computational  burden  of  the  algorithm  is  the  solution  of  the  system 

( A0AT)Ay  =  rb  +  AGr  ~b.  (6) 

Note  that  AGAT  is  symmetric  and  positive  definite,  as  G  is  clearly  a  posi¬ 
tive  definite  diagonal  matrix.  Usually,  interior-point  codes  solve  (6)  through  a 
Cholesky  factorization,  preceeded  by  a  permutation  of  the  columns  of  A  aimed  at 
minimizing  the  fill-in  effect.  Several  effective  heuristics  have  been  developed  for 
computing  such  a  permutation.  L  nfortunately,  when  A  is  the  constraints  matrix 
of  (1),  the  Cholesky  factors  of  AGAT  turn  out  to  be  rather  dense  anyway  [5]. 

However,  the  structure  of  A  can  be  used  to  solve  (6)  without  computing 
the  factorization  of  AQAT .  Note  that  6  is  partitioned  into  the  k  blocks  0\ 
i  —  1  •  •  •  k,  one  for  each  commodity,  plus  the  block  G°  corresponding  to  the 
slack  variables  x°  of  the  mutual  capacity  constraints.  Hence, 


AGAt  = 
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i.e.,  B  is  the  block  diagonal  matrix  having  the  m  x  m  matrices  Bt  =  EQlET , 
i  =  1 . . .  k,  as  diagonal  elements,  and 

CT  =  [Cl  ...Cl]  =  [61E...0kE]  . 

Exploiting  (7),  and  partitioning  the  vectors  Ay  and  b  accordingly,  the  solution 
of  (6)  is  reduced  to 

D-jr  CTB-'c)  Ay0  =  b°  -  £  CjB-'V  =  0° 

i=  1  /  *=1 

BiAy*  =  {¥  -  CiAy0)  =  0\  i  =  l...k. 

The  matrix 

k 

H  =  D-  CTB~lC  =  D  -  CjB-lCi 

2=1 

is  known  as  the  Schur  complement. 

Thus,  (6)  can  be  solved  by  means  of  (8),  involving  the  Schur  complement  H, 
followed  by  the  k  subsystems  (9)  involving  the  matrices  Bt.  The  latter  step  can 
be  easily  parallelized.  However,  solving  (8)  with  a  direct  method,  as  advocated  in 
[16,9],  requires  forming  and  factorizing  H.  As  shown  in  [5],  this  matrix  typically 
becomes  rather  dense,  hence  such  a  direct  approach  may  become  computation¬ 
ally  too  expensive.  Furthermore,  it  represents  a  formidable  serial  bottleneck  for 
a  parallel  implementation  of  the  code.  As  suggested  in  [16],  this  bottleneck  can 
be  reduced  by  using  parallel  linear  algebra  routines  [2, 8, 17].  However,  it  is  also 
possible  to  avoid  forming  H  at  all,  solving  (9)  by  means  of  an  iterative  algorithm. 

Since  H  is  symmetric  and  positive  definite,  a  preconditioned  conjugate  gra¬ 
dient  (PCG)  method  can  be  used.  In  [5],  a  family  of  preconditioners  is  proposed, 
based  on  the  following  characterization  of  the  inverse  of  H: 

H-1  =  D-1  where  Q  =  '£cfBvlCi  (11) 

\  i=0  /  i=1 

A  preconditioner  for  (9)  can  be  obtained  by  truncating  the  above  power  series  at 
the  h- th  term.  Clearly,  the  higher  h,  the  better  the  preconditioning  will  be,  and 
the  fewer  PCG  iterations  will  be  required.  However,  preconditioning  one  vector 
requires  solving  fcx  h  linear  systems  involving  the  matrices  Bi,  thereby  increasing 
the  cost  of  each  PCG  iteration.  The  best  trade-off  between  the  reduction  of  the 
iterations  count  and  the  cost  of  each  iteration  is  h  =  0,  corresponding  to  the 
diagonal  preconditioner  D~1  [5]. 

The  IPM  code,  implementing  this  algorithm,  has  shown  to  be  competitive 
with  a  number  of  other  sequential  approaches  [5].  It  is  written  mainly  m  C, 
with  only  the  Cholesky  factorization  routines  (devised  by  E.  Ng  and  B.  Peyton 
[20])  coded  in  Fortran.  Both  the  sequential  and  parallel  versions  can  be  freely 

obtained  for  academic  purposes  from 

http : //www-eio . upc . es/  j castro/ software . html . 


(8) 

(9) 

(10) 


-495  - 


FEUP  -  Faciddade  de  Engenharia  da  Universidade  do  Porto 


6  J.  Castro  and  A.  Frangioni 

4  Parallelization  of  the  Algorithm 

The  solution  of  (6)  is  by  far  the  most  expensive  procedure  in  the  interior-point 
algorithm,  consuming  up  to  97%  of  the  total  execution  time  for  large  problems. 
W  ith  the  above  approach,  this  can  be  accomplished  by  means  of  the  following 
steps: 

-  Factorization  of  the  k  matrices  Bt:  note  that  the  current  implementation 
uses  sequential  Cholesky  solvers,  but  parallel  Cholesky  solvers  could  be  used 
for  increasing  the  degree  of  parallelism  ofthe  approach. 

-  Computation  of  0°  =  b°  -  £f=1  Cj B^b\  which  requires  k  backsolves  on 
the  factorizations  of  B{  and  matrix- vector  products  of  the  form  Cjv1. 

-  For  each  iteration  of  the  PCG,  computation  of  CIB0lc\)v,  which 

requires  backsolves  on  the  factorizations  of  S,  and  matrix-vector  products 
of  the  form  Ctvl  and  Cjv\ 

-  Computation  of  0l  =  bl  -Ci Ay0,  which  requires  matrix- vector  products  of 
the  form  Ctvl. 

-  Solution  of  the  systems  B^Ay'  =  0\ 

Hence,  most  of  the  parallelization  effort  boils  down  to  performing  in  parallel 
the  factorization  of  the  BiS,  backward  and  forward  substitution  with  these  fac¬ 
torizations  and  matrix- vector  products  involving  C{  or  Cj .  Thus,  there  is  no 
need  for  sophisticated  implementations  of  parallel  linear  algebra  routines.  Note 
that  higher-order  preconditioners  ( h  >  0)  would  complicate  somehow  the  above 
scheme,  but  the  basic  blocks  would  remain  the  same. 

Although  the  above  procedures  are  by  far  the  most  important,  a  number 
of  other  minor  steps  can  be  easily  parallelized,  such  as  the  computation  of  the 
other  primal  and  dual  directions  {Ax\  Az\  Aw{),  the  computation  ofthe  primal 
and  dual  steplenghts  aP  and  aD,  the  updating  of  the  current  primal  and  dual 
solution,  the  computation  of  the  primal  and  dual  objective  function  values  and  so 
on.  It  is  easy  to  see  that  all  the  data  concerning  one  given  commodity  i  (x‘,  cl,  u\ 
y‘,wl ...)  can  be  stored  in  the  local  memory  of  the  one  processor  that  is  in  charge 
of  that  commodity,  and  it  is  never  required  by  other  processors.  This  ensures  a 
good  locality”  of  data,  and  a  low  need  for  inter-processor  communication.  It 
should  also  be  noted  that  the  number  of  operations  required  for  each  commodity 
is  the  same,  which  guarantees  the  load  balancing  between  processors,  at  least  as 
long  as  the  number  of  commodities  assigned  to  each  processor  is  the  same. 


4.1  Parallel  Programming  Environment 

The  parallel  version  of  the  IPM  code,  pIPM,  has  been  developed  on  the  Sili¬ 
con  Graphics  0rigin2000  (SGI  02000)  server  located  at  the  European  Center 
for  Parallelism  of  Barcelona  (CEPBA),  running  an  IRIX64  6.5  Unix  operating 
system.  Like  most  of  the  current  parallel  architectures,  the  SGI  02000  offers 
both  message-passing  and  shared-memory  programming  paradigms,  although 
the  main  memory  is  physically  distributed  among  the  processors.  The  server  has 
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64  MIPS  R10000  processors  running  at  250Mhz,  each  of  them  with  32+32Kb  LI 
cache  and  4Mb  L2  cache  and  credited  of  14.7  SPECint95  and  24.5  SPECfp95. 
A  total  of  8Gb  of  memory  is  distributed  among  these  processing  elements.  This 
computer  appeared  at  position  275  of  the  TOP500  November  1998  supercom¬ 
puter  sites  list  [10]. 

The  default  programming  style  supported  by  the  SGI  02000  is  a  custom 
shared-memory  version  of  C  [24],  with  parallel  constructs  specified  by  means  of 
compiler  directives  (#pragmas).  Placement  of  the  memory  on  the  processors  and 
communication  is  hidden  to  the  programmer  and  automatically  performed  by 
the  system.  The  main  advantage  of  this  choice  is  ease  of  portability:  existing 
codes  can  be  parallelized  with  a  limited  effort.  It  is  even  possible  to  avoiding 
maintaining  two  different  versions  (sequential  and  parallel)  of  the  same  code, 
which  is  important  to  optimize  the  development  efforts. 

However,  this  programming  style  also  has  a  number  of  drawbacks,  mainly  a 
limited  control  over  memory  ownership  and  limited  support  for  vector-broadcast 
and  vector-reduce  operations.  Placement  of  the  data  structures  in  the  local  mem¬ 
ory  of  the  processors  can  be  only  partly  (and  indirectly)  influenced  by  the  pro¬ 
grammer.  Also,  the  granularity  of  memory  placement  is  that  of  the  virtual  mem¬ 
ory  pages  (16K)  rather  than  that  of  the  individual  data  structures.  All  this  can 
result  in  cache  misses  and  page  faults  from  the  local  memory  of  each  processor, 
decreasing  the  performance  of  the  parallel  codes.  Although  advanced  directives 
allow  a  more  detailed  control  over  these  features,  the  use  of  those  directives  re¬ 
quires  a  more  extensive  rewriting  of  the  code,  thus  loosing  part  of  the  benefits  in 
terms  of  portability  and  ease  of  maintenance.  Because  of  that,  the  computational 
results  presented  in  Section  5  were  obtained  with  the  default  data  distribution 
provided  by  the  system  (the  same  used  in  [2]).  However,  the  assignment  of  com¬ 
modities  to  processors  was  optimized  for  this  distribution,  hopefully  limiting  the 
possible  negative  effects.  The  limited  support  for  broadcast/reduce  operations 
is  understandable  in  a  shared-memory  oriented  language;  however,  it  may  result 
in  poorer  performances  for  codes,  like  pIPM,  where  these  operations  amount  at 
almost  the  totality  of  the  communication  time. 

5  Computational  Results 

5.1  The  Instances 

Three  sets  of  multicommodity  instances  were  used  for  the  computational  experi¬ 
ments.  The  first  is  made  up  of  18  problems  obtained  with  an  improved  version  of 
Ali  and  Kennington’s  Mnetgen  generator  [11].  These  instances  are  very  large  (up 
to  about  2.5  millions  of  variables  and  260,000  constraints),  with  the  number  of 
commodities  which  varies  from  very  few  (8)  to  quite  many  (512).  This  is  useful 
for  characterizing  the  trends  in  the  performances  of  the  code  as  the  number  of 
commodities  varies  [7,11]. 

The  second  set  consists  of  ten  of  the  PDS  (Patient  Distribution  System) 
problems.  These  problems  arise  from  a  logistic  model  for  evacuating  patients 
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from  a  place  of  military  conflict.  The  different  instances  arise  from  the  same 
basic  scenario  by  varying  the  time  horizon,  i.e,  the  number  of  days  covered  by 
the  model.  The  PDS  problems  has  been  considered,  until  recently,  essentially 
impossible  to  solve  with  a  high  degree  of  accuracy.  Although  this  has  changed, 
they  are  still  quite  challenging  multicommodity  instances. 

The  third  set  of  problems  is  made  of  the  four  Tripart  problems  and  of  the 
Gridgenl  problem.  These  instances  were  obtained  with  the  Tripartite  generator 
and  with  a  variation  for  multicommodity  flows  of  the  Gridgen  generator  [4] 
These  are  very  difficult  multicommodity  flow  instances,  as  shown  in  Section  5.3. 

The  dimensions  of  each  problem  are  reported  in  Tables  1,  2  and  3.  Columns 
“m”,  “n”,  and  “k”  show  the  number  of  nodes,  arcs,  and  commodities.  Columns 
■V  and  “m”  give  the  number  of  variables  and  constraints  of  the  linear  problem. 
All  the  instances  can  be  downloaded  from 

http:/ /www.di.unipi . it/di/groups/opt imize/Data. 


5.2  Performance  Measures 

The  following  well-known  performance  measures  [3]  will  be  considered  for  assess¬ 
ing  the  performances  of  pIPM.  Denoting  by  Tp  the  execution  time  obtained  with 
p  processors,  the  speedup  Sp  with  p  processors  can  be  defined  as  Sp  —  Ti/T  . 
The  fraction  of  the  sequential  execution  time  consumed  in  the  parallel  region  of 
the  code  will  be  denoted  by  /;  values  of  /  close  to  1  are  necessary  in  order  to 
obtain  good  speedups,  as  demonstrated  by  Amdahl’s  law 

Sp  <  s:  =  - 1 -  <  _ l _ 

P-  P  f/p+(l~f)  -  (1-/)  ‘ 

The  efficiency  with  p  processors  is 


Ep  represents  the  fraction  of  the  time  that  a  particular  processor  (of  the  p 
available)  is  usefully  employed  during  the  execution  of  the  algorithm.  S p  and  Ep 
are  respectively  the  ideal  speedup  and  efficiency,  the  maximum  ones  that  can  be 
obtained  due  to  the  inherent  serial  bottlenecks  in  the  algorithm. 

Another  interesting  performance  measure  is  the  absolute  speedup,  obtained 
by  replacing  7\  with  the  execution  time  of  the  best  serial  algorithm  known.  This 
is  usually  difficult  to  obtain,  and  it  will  be  discussed  separately. 


5.3  The  Results 

Tables  J.,  2  and  3  show  the  computational  results  obtained.  Columns  “EP”  and 
“PCG”  report  the  total  number  of  interior-point  and  PCG  iterations,  respec¬ 
tively.  Column  gives  the  fraction  of  the  total  sequential  time  consumed  in 
the  parallel  region  of  the  code.  Column  “p”  gives  the  number  of  processors  used 
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Table  1.  Dimensions  and  results  for  the  Mnetgen  problems. 
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2721.4 

1.0  1.0 

1.0 

1.0 

8 

454.7 

6.0  7.4 

0.7 

0.9 

16 

299.3 

9.1  13.6 

0.6 

0.8 

32 

289.3 

9.4  23.3 

0.3 

0.7 

512-64 

512  4768 

64 

309920 

37536 

99.2 

95  27004 

1 

9244.5 

1.0  1.0 

1.0 

1.0 

8 

1271.5 

7.3  7.6 

0.9 

0.9 

16 

702.8 

13.2  14.3 

0.8 

0.9 

32 

507.9 

18.2  25.6 

0.6 

0.8 

64 

563.8 

16.4  42.6 

0.3 

0.7 

512-128 

512  4786 

128 

617394 

70322 

99.3 

112  28631 

1 

19385.9 

1.0  1.0 

1.0 

1.0 

8 

3237.0 

6.0  7.6 

0.7 

1.0 

16 

1780.6 

10.9  14.5 

0.7 

0.9 

32 

1271.5 

15.2  26.3 

0.5 

0.8 

64 

848.5 

22.8  44.4 

0.4 

0.7 

512-256 

512  4810  256 

1236170 

135882 

99.5 

130  32676 

1 

43251.2 

1.0  1.0 

1.0 

1.0 

8 

7401.6 

5.8  7.7 

0.7 

1.0 

16 

5306.7 

8.2  14.9 

0.5 

0.9 

32 

2783.7 

15.5  27.7 

0.5 

0.9 

64 

2205.9 

19.6  48.7 

0.3 

0.8 

512-512 

512  4786  512 

2455218 

266930 

99.6 

194  48229 

1 

135753.7 

1.0  1.0 

1.0 

1.0 

8 

25257.7 

5.4  7.8 

0.7 

1.0 

16 

14198.4 

9.6  15.1 

0.6 

0.9 

32 

8325.3 

16.3  28.5 

0.5 

0.9 

|  64 

5226.0 

26.0  51.1 

0.4 

0.8 
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Table  2.  Dimensions  and  results  for  the  PDS  problems. 


ffi  n  k  h  m 

/ 

IP  PCC, 

P 

Tv 

.<?„ 

Sp 

Ep 

e' 

PDSl 

126  372  11  4464  1758 

83.3 

30 

169 

1 

0.7 

1.0 

"To- 

1.0 

1.0 

6 

0.5 

1.3 

3.3 

0.2 

0.5 

11 

0.7 

0.9 

4.1 

0.1 

0.4 

PDS10 

1399  4792  11  57504  20181 

94.7 

66 

1107 

1 

44.8 

1.0 

1.0 

1.0 

1.0 

6 

25.3 

1.8 

4.7 

0.3 

0.8 

11 

24.6 

1.8 

7.2 

0.2 

0.7 

PDS20 

2857  10858  11  130296  42285 

96.6 

69 

1911 

1 

254.1 

1.0 

1.0 

1.0 

1.0 

6 

70.9 

3.6 

5.1 

0.6 

0.9 

11 

62.6 

4.1 

8.2 

0.4 

0.7 

PDS30 

4223  16148  11  193776  62601 

97.9 

92 

3835 

1 

777.1 

1.0 

1.0 

1.0 

1.0 

6 

206.4 

3.8 

5.4 

0.6 

0.9 

11 

189.2 

4.1 

9.1 

0.4 

0.8 

PDS40 

5652  22059  11  264708  84231 

97.9 

73 

1872 

1 

1288.1 

1.0 

1.0 

1.0 

1.0 

6 

258.4 

5.0 

5.4 

0.8 

0.9 

11 

194.1 

6.6 

9.1 

0.6 

0.8 

PDS50 

7031  27668  11  332016  105009 

98.8 

100 

4711 

1 

3486.4 

1.0 

1.0 

1.0 

1.0 

6 

727.3 

4.8 

5.7 

0.8 

0.9 

11 

530.1 

6.6 

9.8 

0.6 

0.9 

PDS60 

8423  33388  11  400656  126041 

99.0 

106 

5215 

1 

6262.0 

1.0 

1.0 

1.0 

1.0 

6 

1252.4 

5.0 

5.7 

0.8 

1.0 

11 

745.4 

8.4 

10.0 

0.8 

0  9 

PDS70 

9750  38396  11  460752  145646 

99.2  116 

7015 

1 

10873.8 

1.0 

1.0 

1.0 

1.0 

6 

2112.2 

5.1 

5.8 

0.9 

1.0 

11 

1268.5 

8.6 

10.2 

0.8 

0.9 

PDS80 

10989  42472  11  509664  163351 

99.2 

107 

3768 

1 

8855.0 

1.0 

1.0 

1.0 

1.0 

6 

1726.3 

5.1 

5.8 

0.9 

1.0 

11 

1093.8 

8.1 

10.2 

0.7 

0.9 

PDS90 

12186  46161  11  553932  180207 

99.4 

135 

9357 

1  20784.3 

1.0 

1.0 

1.0 

1.0 

6 

3950.5 

5.3 

5.8 

0.9 

1.0 

11 

2447.8 

8.5 

10.4 

0.8 

0.9 

Table  3.  Dimensions  and  results  for  the  Tripart  and  Gridgen  problems. 


m 

n 

k  n 

m 

/ 

IP 

PCG 

P 

Tp 

Sp  Sv 

Ep 

Ep 

Tripartl 

192 

2096 

16  35632 

5168 

93.6 

65 

3733 

1 

34.9 

1.0  1.0 

1.0 

1.0 

4 

21.3 

1.6  3.4 

0.4 

0.8 

8 

17.9 

1.9  5.5 

0.2 

0.7 

Tripart2 

768 

8432 

16  143344 

20720 

91.8 

63 

2652 

16 

1 

19.6 

156.6 

1.8  8.2 

1.0  1.0 

0.1 

1.0 

0.5 

1.0 

4 

71.6 

2.2  3.2 

0.5 

0.8 

8 

55.4 

2.8  5.1 

0.4 

0.6 

Tripart3 

1200 

16380 

20  343980 

40380 

94.9 

84 

9343 

16 

1 

60.3 

1140.7 

2.6  7.2 
1.0  1.0 

0.2 

1.0 

0.4 

1.0 

4 

408.4 

2.8  3.5 

0.7 

0.9 

10 

300.5 

3.8  6.9 

0.4 

0.7 

Tripart4 

1050  24815 

35  893340 

61565 

95.6 

96 

8498 

20 

1 

304.8 

3273.2 

3.7  10.2 
1.0  1.0 

0.2 

1.0 

0.5 

1.0 

5 

893.7 

3.7  4.3 

0.7 

0.9 

7 

721.5 

4.5  5.5 

0.6 

0.8 

Gridgenl 

1025 

3072  320  986112  331072 

99.5 

173 

49981 

35 

1 

601.1 

37234.9 

5.4  14.0 
1.0  1.0 

0.2 

1.0 

0.4 

1.0 

8 

10533.2 

3.5  7.7 

0.4 

1.0 

16 

7678.7 

4.8  14.9 

0.3 

0.9 

32 

4426.5 

8.4  27.7 

0.3 

0.9 

64 

3248.6 

11.5  48.7 

0.2 

0.8 
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in  the  execution.  “Tp”  denotes  the  execution  (wall-clock)  time,  excluding  initial¬ 
izations.  Columns  “5P”  and_“Ep”  giverespectively  the  observed  speedups  and 
efficiencies,  while  columns  “5P”  and  “Ep”  report  their  ideal  values. 

Analyzing  the  results,  the  following  trends  emerge: 

-  /  is  always  fairly  large,  and  increases  with  the  problem  size;  the  largest 
problems  attain  very  high  ideal  efficiencies.  This  indicates  that  the  approach 
has  a  good  potential  for  scalablility,  at  least  in  theory,  for  very  large  scale 
problems. 

-  For  fixed  p  and  k,  Ep  almost  always  increases  with  the  size  of  the  underlying 
network,  in  all  three  groups  of  instances.  This  is  reasonable:  the  computa¬ 
tional  burden  of  the  PCG  iteration  grows  quadratically  with  the  number 
of  nodes,  while  the  communication  cost  grows  only  linearly.  This  seems  to 
indicate  that  the  approach  is  especially  suited  for  problems  where  the  size 
of  the  network  is  large  w.r.t.  the  number  of  commodities.  Remarkably,  IPM 
has  been  shown  to  be  particularly  efficient,  at  least  w.r.t.  decomposition 
approaches,  exactly  for  this  kind  of  instances  [11]. 

-  Keeping  p  and  the  size  of  the  network  fixed,  Ep  initially  increases  with  k; 
however  for  “large”  values  of  k  Ep  stalls,  and  may  even  decrease.  This  phe¬ 
nomenon,  clearly  visible  in  the  Mnetgen  results,  is  difficult  to  explain.  For 
fixed  p,  increasing  k  can,  in  theory,  only  increase  the  fraction  of  time  that  is 
spent  in  the  parallel  part  of  the  algorithm,  while  the  sequential  bottleneck 
and  the  communication  requirements  should  remain  the  same.  Indeed,  Ep 
is  monotonically  nondecreasing  with  k.  This  decrease  in  efficiency  is  most 
likely  an  effect  of  the  page-based  memory  placement,  which  may  cause  data 
logically  pertaining  to  one  processor  to  be  phisically  located  on  another. 

-  For  any  fixed  instance,  Ep  obviously  decreases  as  p  increase;  unfortunately, 
the  decrease  is  much  faster  than  that  predicted  by  Ep,  so  that  the  gap 
between  Ep  and  Ev  increases  with  p.  However,  for  fixed  p  the  gap  decreases 
when  the  size  of  the  network  increase,  and  a  similar — although  less  clear 
trend  seems  to  exist  w.r.t.  k.  Thus,  whatever  mechanism  be  responsible  for 
this  discrepancy  between  Ep  and  Ep,  its  effects  seem  to  lessen  as  the  instances 
grow  larger. 

Since,  except  for  PDS  problems  with  p  =  6,  each  processor  is  assigned  the 
same  number  of  commodities,  there  can  be  no  load  imbalance  between  the  pro¬ 
cessors.  Thus,  the  gap  between  Ep  and  TTV  can  only  be  explained  as  being  due 
to  communication  time.  Indeed,  pIPM  requires  more  communication  than  most 
other  parallel  codes  for  multicommodity  flows.  Most  of  communication  occurs 
during  the  computation  of  (p  -  £;=i  CjBrlC 4)  v,  where  v  is  the  current  esti¬ 
mate  of  the  solution  of  (8),  at  each  PCG  iteration.  This  requires  first  the  broad¬ 
cast  of  v  from  the  “master”  processor  (the  one  executing  the  serial-only  part 
of  the  code)  to  all  the  other  processors,  followed  by  a  vector-reduce  operation 
to  accumulate  all  the  partial  results  Cj B~lv  back  to  the  “master”  processor. 
The  amount  of  communication  is  essentially  the  same  as  in  the  decomposition 
approaches  [7,12,21],  and  substantially  lower  than  that  of  the  other  specialized 
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parallel  interior-point  codes  [16,9],  which  need  to  share  the  (dense)  matrices 
Cj 1Cj  in  order  to  form  the  Schur  complement  H.  However,  in  pIPM  com¬ 
munication  occurs  at  every  PCG  iteration ,  i.e.,  much  more  often  than  in  decom¬ 
position  codes.  The  other  specialized  parallel  interior-point  codes  have  a  much 
smaller  number  of  communication  "rounds” ,  one  for  each  interior-point  iteration, 
although  each  round  is  more  expensive. 

Thus,  pIPM  may  be  inherently  more  vulnerable  to  slowdowns  induced  by 
communication  costs.  Indeed,  the  efficiency  of  pIPM  seems  to  be,  on  average, 
somehow  worse  than  that  of  the  approach  in  [16],  even  though  direct  comparison 
is  difficult  due  to  the  different  sets  of  test  problems.  The  instances  used  in  [16] 
are  much  smaller,  and  the  cost  of  forming  and  factorizing  H  grows  rapidly  with 
the  size  of  the  problem. 

Furthermore,  the  current  implementation  of  pIPM,  using  the  parallel  con¬ 
structs  available  in  the  SGI  02000  C  compiler  [24],  is  not  aggressively  optimized 
particularly  in  the  two  critical  operations,  i.e.,  broadcasts  and  vector-reduces. 
Both  are  currently  obtained  by  means  of  read/write  operations  to  shared  vec¬ 
tors,  which  are  presumably  less  efficient  than  the  typical  system-provided  imple¬ 
mentation  which  exploits  information  about  the  topology  of  the  interconnection 
network  and  the  available  communication  hardware.  Also,  a  part  of  the  commu¬ 
nication  overhead  could  be  due  to  a  non-optimal  placement  of  the  data  structures 
m  the  local  memory  of  the  processors,  especially  at  the  boundaries  of  the  virtual 
memory  pages.  Thus,  we  believe  that  there  is  still  room  for  (potentially  large)  re¬ 
ductions  of  the  gap  between  the  observed  and  the  theoretical  speedup/efficiency 
of  the  code.  However,  pIPM  already  attains  quite  satisfactory  efficiencies  in  some 
instances,  most  notably  the  largest  PDS  problems. 

Table  4.  Comparing  Cplex  6.5  and  IPM  on  the  Tripart  and  Gridgen  problems. 


Problem 

IPM  Cplex  6.5 

Tripart  1 
Trip  art  2 
Tripart3 
Tripart4 
Gridgen  1 

40  74 

249  627 

1584  2851 

4983  33235 

126008  >  2.8e+6 

As  far  as  the  absolute  speedup  is  concerned,  IPM  is  known  not  to  be  the 
fastest  sequential  code  for  some  of  the  test  instances.  In  [11],  a  bundle-based 
decomposition  approach  has  been  shown  to  outperform  IPM  on  the  Mnetgen 
instances,  while  IPM  was  competitive  on  the  PDS  problems.  Furthermore,  re¬ 
cent  developments  in  the  field  of  simplex  methods  [18]  have  lead  to  impressive 
performance  improvements  for  these  algorithms  on  multicommodity  flow  prob¬ 
lems.  Nowadays,  even  the  largest  PDS  problems  can  be  solved  in  less  than  an 
hour  of  CPU  with  the  state-of-the-art  simplex  code  Cplex  6.5.  However,  the 
simplex  method  is  not  easily  parallelized.  Furthermore,  other  multicommoditv 
problems,  like  the  Tripart  and  the  Gridgen,  are  much  more  difficult  to  solve: 
e-approximation  algorithms  can  approximatively  solve  them  in  a  relatively  short 
time  [4],  but  only  if  the  required  accuracy  is  not  high.  On  these  instances,  the 
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interior-point  algorithm  in  Cplex  6.5  is  far  more  efficient  than  the  dual  simplex, 
but  it  is  in  turn  largely  outperformed  by  IPM,  as  shown  in  Table  4.  Columns 
“IPM”  and  “Cplex  6.5”  represents  the  running  time  required  for  the  solution 
of  the  problem  by  IPM  and  Cplex  6.5,  respectively,  on  a  Sun  Ultra2  2200/200 
workstation  (credited  of  7.8  SPECint95  and  14.7  SPECfp95)  with  1Gb  of  main 
memory.  Thus,  for  the  largest  and  more  difficult  instances  of  the  set,  pIPM 
provides  a  competitive  approach. 

6  Conclusions  and  Future  Research 

The  parallel  code  pIPM  presented  in  this  work  can  be  an  efficient  tool  for  the 
solution  of  certain  types  of  large  and  difficult  multicommodity  problems.  Quite 
good  speedups  are  achieved  in  some  instances,  such  as  the  large  PDS  problems.  In 
other  cases,  a  gap  between  the  ideal  efficiency  and  the  observed  one  exists.  How¬ 
ever,  we  are  confident  that  a  more  efficient  implementation  of  reduce/broadcast 
operations  and  a  better  placement  of  data  structures— which  could  mean  using 
MPI  or  PVM  as  parallel  environments — can  make  pIPM  even  more  competitive 
on  a  widest  range  of  multicommodity  instances. 
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Abstract.  Simulation  of  the  dynamic  behaviour  of  liquid-liquid  systems  is  of 
prominent  importance  in  many  industrial  fields.  Algorithms  for  fast  and  reliable 
simulation  of  single  stirred  vessels  and  extraction  columns  have  already  been 
published  by  some  of  the  present  authors.  In  this  work,  we  propose  a 
methodology  to  develop  a  parallel  version  of  a  previously  validated  sequential 
algorithm,  for  the  simulation  of  a  liquid-liquid  Kiihni  column.  We  also  discuss 
the  algorithm  implementation  in  a  distributed  memory  parallel-computing 
environment,  using  MPI.  Despite  the  difficulties  encountered  to  preserve 
efficiency  in  the  case  of  a  heterogeneous  cluster,  the  results  demonstrate 
performance  improvements  that  clearly  indicate  that  the  approach  followed  may 
be  successfully  extended  to  allow  real-time  plant  control  applications. 

Key  words:  Distributed  Memory  Parallel  Systems;  MPI;  Simulation  of  Liquid-Liquid 

Systems. 


1.  Introduction 


The  mass  transfer  efficiency  of  liquid-liquid  agitated  systems  is  highly  dependent 
on  the  hydrodynamics  of  the  dispersed  phase,  namely  of  the  drop  break-up  and 
coalescence  frequencies  that  result  from  the  turbulence  induced  by  agitation.  In 
reacting  systems,  this  behaviour  is  also  of  fundamental  importance  to  the  overall  rate 
and  selectivity  of  the  process.  A  comprehensive  and  synthetic  discussion  about  the 
behaviour  of  liquid-liquid  systems  is  found  in  Ramkrishna’s  work  [1]. 

Knowledge  of  the  dynamic  behaviour  of  liquid-liquid  systems  is  still  limited,  in 
particular  when  it  comes  to  its  implementation  as  physically  accurate,  fast  and  reliable 
algorithms,  with  effective  predictive  power  and  suitable  for  real-time  plant  control 
applications  [2].  Potential  fields  of  practical  use  of  this  knowledge  base  encompass 
very  broad  segments  of  chemical  technology,  including  the  recovery  of  important 
non-renewable  resources  or  the  removal  of  dangerous  substances. 

Ribeiro  L.  M.  [3]  and  Ribeiro  L.  M.  et  al.  [4]  published  innovative  algorithms  for 
directly  (numerically)  solving  the  population  balance  equation  for  the  simulation  of 
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the  full  trivariate  (drop  volume,  v,  solute  concentration,  c,  and  age,  •)  unsteady-state 
behaviour  of  interacting  liquid-liquid  dispersions,  in  single  continuous  (or  batch) 
stirred  vessels.  Not  only  the  start-up  period  towards  the  steady-state  was  simulated  but 
also  the  system  s  response  to  disturbances  in  the  main  operating  variables  (mean 
residence  time,  dispersed  phase  hold-up,  agitation  power  input  density,  feed  drop 
volume  distribution  and  dispersed  and  continuous  phase  solute  concentrations).  The 
methodology  used  was  later  applied  to  a  simplified  version  of  the  algorithm,  that 
calculates  the  drop  size  distribution  and  the  mean  and  standard  deviation  of  solute 
concentration  within  each  volume  class  [5],  This  methodology  was  further  extended 
to  simulate  the  behaviour  of  a  liquid-liquid  extraction  column  [6], 

The  aim  of  this  paper  is  to  show  that,  using  low  cost  high  performance  computing 
environments  and  the  above  referred  methodology,  it  is  possible  to  simulate  in  detail 
the  dynamics  of  stirred  liquid-liquid  extraction  columns,  with  execution  times  suitable 
for  prediction  of  the  behaviour  of  these  systems  and  for  control  purposes. 


2.  The  sequential  algorithm 


Following  the  experimental  work  carried  out  by  Gomes  [7]  in  a  Kiihni  pilot  plant 
column  of  the  Technical  University  of  Munich,  a  sequential  algorithm  was  developed 
to  trace  its  dynamics  [6],  This  column  has  150mm  of  internal  diameter  and  36  stages 
each  70  mm  high.  ’ 

A  Kiihni  column  may  be  adequately  described  as  a  sequence  of  agitated  vessels 
with  back  mixing  and  forward  mixing  effects  on  the  movement  of  the  dispersed  phase 
along  the  column.  The  hydrodynamic  phenomena  of  break-up  and  coalescence  of  the 
individual  drops  of  the  dispersed  phase  was  modeled  using  the  population  balance 

formulation  of  Coulaloglou  and  Tavlarides  [8], 

Besides  the  interaction  phenomena,  the  transport  of  the  drops  from  one  stage  to  the 
next  must  also  be  modeled.  The  transport  model  used  was  based  on  the  one  described 
by  Cruz-Pinto  [9],  taking  into  account  the  constriction  factor  calculated  by  Goldman 
[10]  and  the  dispersion  equation  developed  by  Regueiras  [11],  The  mathematical 
model  equations  used  are  presented  elsewhere  [1 1], 

From  the  mathematical  model,  the  drop  birth  and  death  rates  due  to  break-up, 
coalescence  and  drop  movement  along  the  column  are  calculated.  Representing  by 
B(n,t)  and  these  source  and  sink  terms,  at  time  t  and  location  [n  ,n+&n]  of 

the  drop  phase  space,  the  dynamics  of  the  drop  number  density  function  X(n,  t )  is 
described  by: 


JtX  {WJ)  + 


d 

dn~ 


dn 

J7 


X  (nj) 


B  (n  J)  -  D  (n  ,t) 


(1) 


To  numerically  solve  the  above  population  balance  equation,  a  phase  space-time 
discretization  is  used  and  drops  are  assumed  to  reside  on  cell  sites.  Drops  move  from 
cell  to  cell  in  the  discretized  phase-space  at  each  time  step.  The  numerical  integration 
scheme  involves  the  explicit  calculation  of  time  derivatives,  with  a  first-order 
backward  finite-difference  method  [4], 
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The  sequential  algorithm  developed  for  the  counter-current  Kiihni  column 
simulation  is  able  to  predict  the  local  drop  size  distributions  and  the  local  hold-up 
profiles  of  the  column.  The  algorithm  was  implemented  in  C++  and  the  corresponding 
program  is  presently  available  for  Windows  9x  and  Windows  NT  environments  [6]. 

The  program  consists  of  two  parts:  the  initialization  of  the  system  and  the  column 
simulation.  The  corresponding  flowchart  is  presented  in  Fig.  1. 

The  main  program  starts  reading  all  data  needed  to  perform  the  simulation.  This 
data  includes  the  physical  characteristics  of  the  column,  like  the  number  of  stages, 
stirrer  diameter,  height  and  diameter  of  each  stage,  the  drop  breakage,  coalescence 
and  transport  model  parameters,  the  physical  properties  of  both  phases,  such  as 
density,  viscosity  and  interfacial  tension,  the  operating  conditions  of  the  column, 
namely  the  flow  rates  of  each  phase  and  the  stirrer  rotational  speed,  the  total 
simulated  time,  tmax,  and  the  time  interval.  At,  at  which  the  program  writes  to  a  file 
the  values  of  the  column  and  system  state  variables. 

At  time  t= 0,  the  column  variables  are  initialized  to  a  standard  initial  state, 
corresponding  to  a  column  filled  with  continuous  phase  and  no  dispersed  phase. 

The  program  goes  then  into  a  loop  where  it  writes  the  values  of  the  column 
variables  on  a  file,  tests  if  the  time  reached  the  total  simulated  time  value  and,  if  not, 
calls  the  TimeStep  routine  to  calculate  the  column  status  at  time  t+At.  Then,  it 
updates  the  value  of  t,  and  returns  to  the  beginning  of  the  loop.  When  tmax  is  reached 
the  program  exits  the  loop,  writes  global  results  to  a  file  and  terminates  execution. 

The  routine  TimeStep  executes  the  simulation  of  the  column  for  a  period  of  time, 
At,  between  two  consecutive  WiriteData  calls.  In  order  to  accomplish  this 
objective,  the  routine  calls  the  dxdt  routine  for  each  column  stage  and,  based  on  the 
death  frequencies  obtained,  calculates  a  suitable  step  value  for  the  integration.  This 
value,  dt,  is  then  used  to  calculate  the  new  values  of  the  variables  describing  the  state 
of  the  column.  When  the  accumulated  time  reaches  At,  this  routine  is  exited,  returning 
control  to  the  main  program  loop. 

The  routine  dXdt  calculates  the  drop  birth  and  death  frequencies  inside  a  single 
column  stage,  as  well  as  the  number  of  drops  per  unit  time  exchanged  with  the 
contiguous  stage.  It  also  calculates  the  continuous  phase  flow  rate  between  the  same 
two  stages.  To  perform  these  calculations,  this  routine  needs  the  values  of  the  statue 
variables  at  both  stages.  Only  the  auxiliary  variables  of  the  current  stage  are  modified. 
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The  hierarchy  of  the  called  routines  and  the  routine  tasks  are  outlined  in  Fig.  2  and 
Table  1,  respectively. 

The  routine  LLExtrColumns  corresponds  to  the  ‘Initialization  of  the  Column’ 
box  and  to  the  ‘Initialization  of  the  variables’  box.  TimeStep  and  dxdt  routines  are 
designated  on  the  flowchart  for  their  own  names. 


Fig.  2.  The  hierarchy  of  the  called  routines 


ClearDerivatives 

Prepares  the  variables  for  the  calculations  in  dXdt. 

dXdt 

Calculates  the  drop  birth  and  death  frequencies  of  one  stage. 

LLColMain 

Main  part  of  the  program;  calls  the  routines. 

LLExtrColmuns 

Prepares  each  stage  for  the  beginning  of  the  simulation  and 
calculates  the  inlet  drop  distributions. 

TimeStep 

Executes  the  simulation  for  a  given  period  of  time. 

WriteData 

Outputs  to  a  file  the  results  at  the  end  of  each  time-step. 

WriteFinalData 

Outputs  to  a  file  the  final  results. 

Table  1.  Routine  tasks 


We  have  already  shown  that  the  results  obtained  with  the  sequential  program  for 
the  hold-ups  and  the  drop  size  distributions  at  different  stages  of  the  column  are  in 
good  agreement  with  the  experimental  data,  for  several  operating  conditions  of  the 
column  [7]. 

So  far,  the  program  doesn’t  include  mass  transfer  calculations.  With  mass  transfer, 
it  is  generally  necessary  to  solve  the  population  balance  equation  (1)  in  a  tri¬ 
dimensional  phase-space.  In  the  present  case,  using  a  monovariate  drop  property 
(volume)  distribution,  the  execution  time  achieved  with  a  120  MHz  Pentium  for  one 
second  of  simulation  time  was  four  times  longer  than  the  real  process,  with  a  drop 
volume  disctretization  of  20  classes.  Although  already  fast,  in  comparison  to  other 
resolution  approaches  [2],  this  algorithm  needs  to  be  further  accelerated  in  order  to  be 
suitable  for  future  control  applications  to  liquid-liquid  extraction  columns,  in  mass 
transfer  conditions.  The  introduction  of  excessive  algorithm  simplifications,  other 
than  those  of  the  underlying  mathematical  model,  are  not  desirable,  as  they  would 
hide  most  of  the  information  on  the  temporal  behaviour  of  the  dispersed  phase 
properties  distribution.  This  need  to  speedup  the  calculations  was  the  motivation  for 
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the  development  of  a  parallel  version  of  the  sequential  program.  This  parallel  version, 
implemented  for  a  distributed  memory  parallel  computing  environment,  is  nowadays 
the  only  published  promising  approach  to  the  future  realistic  simulation  of  various 
contactors,  including  extraction  columns,  and  their  control. 


3.  The  parallelization  approach 


3.1  Initial  considerations 

A  sequential  C  code  was  written  for  the  algorithm  to  ensure  that  the  calculations  in 
each  time  step  only  need  the  results  from  the  previous  iteration. 

The  analysis  of  the  logical  units  of  this  sequential  code  pointed  out  the 
methodology  used  to  develop  a  parallel  version  of  the  algorithm.  Table  2  clearly 
shows  that  the  most  time  consuming  routine  is  the  one  responsible  for  calculating  the 
drop  birth  and  death  frequencies  (due  to  drop  breakage,  coalescence,  and  transport)  in 
each  time  step  and  for  each  column  stage  (dXdt  routine).  The  time  taken  by  the 
execution  of  the  other  routines  is  relatively  insignificant  and  is  not  shown  in  Table  2. 
The  parallel  version  of  the  algorithm  is  thus  based  on  the  partition  of  the  calculation 
of  these  frequencies,  for  each  time  step,  among  the  several  processors  available  The 
synchronization  is  made  at  the  end  of  each  iteration. 


Name 

Time 

-  .(%) 

Secs 

Calls 

Calls 

(ms/call) 

Total 

(ms/call) 

dXdt 

87.40 

2.29 

5040 

0.45 

0.49 

TimeStep 

4.20 

0.11 

20 

5.50 

131.00 

Table  2,  The  most  time  consuming  routines 


3.2  The  MPI  implementation 

The  parallel  program  was  implemented  in  C  for  a  distributed  memory  parallel- 
computing  environment  using  MPI  (MPICH,  1.1.2.). 

The  flowchart  below  shows  that  all  of  the  processes  call  the  TimeStep  routine.  In 
this  routine,  the  master  sends  a  sequence  of  stages  for  each  one  of  the  other  processes, 
keeping  the  first  group  for  itself.  Each  process  also  receives  the  last  stage  of  the 
previous  process,  since  this  information  is  needed  for  the  calculations.  All  the 
processes,  including  the  master,  contribute  to  the  calculation,  calling  the  dXdt 
routine.  The  master  receives  all  the  results  sent  by  the  other  processes  at  the  end  of 
each  time  step  and  performs  the  control  calculations,  such  as  the  overall  hold-up  and 
the  verification  of  an  eventual  column  flooding  situation. 
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Master  (process  pO)  Processes  p  l ,  ....  pn-l 


Fig.  3.  The  parallel  algorithm 
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Master  t process  p(»  Processes  p  I pn-l 


Fig.  4.  The  TimeStep  routine 


In  order  to  minimize  the  overload  due  to  information  exchange,  presently  about 
4KB  for  each  stage  (13,3  KB  when  mass  transfer  is  included),  every  information  was 
sent  once  (MPI_ISend),  taking  advantage  of  the  count  and  derived  types  MPI 
parameters.  ' 

The  program  was  first  tested  both  on  a  heterogeneous  cluster  and  on  a 
homogeneous  one.  On  the  heterogeneous  cluster,  from  the  Engineering  Faculty  of  the 
University  of  Porto,  five  Alpha  processors  were  used,  with  different  clock  rates  150 
MHz  (2  nodes),  175  MHz  (2  nodes)  and  266  MHz  (1  node).  A  100  Mbps  FDDI 
crossbar  switch  (Digital  Equipment  Coorporation/Compaq  GIGAswitch)  connects 
these  nodes.  The  operating  system  is  True64  Unix  v4.0E.  On  the  homogeneous 
cluster,  from  the  Dolphin  [12]  project  of  the  Science  Faculty  of  the  University  of 
Porto,  four  dual  Pentium  II,  300  MHz  processors,  interconnected  by  a  Myrinet 
network,  were  used.  The  operating  system  was  Linux  Redhat  6.0. 

Besides  validation  of  the  results,  the  possibility  of  using  these  different 
computation  environments  enabled  us  to  identify  problems  in  preserving  efficiency 
for  heterogeneous  clusters  [13],  The  comparison  of  Fig.5  and  Fig.6,  that  show  the 
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monitor  results  of  the  jumpshot  public  domain  utility,  already  discloses  these 
problems.  These  figures  show  the  inter-process  communications  for  the 
heterogeneous  cluster,  with  five  processors,  and  for  the  homogeneous  cluster,  with  six 
processors,  both  for  a  drop  volume  discretization  of  20  classes.  The  black  blocks 
represent  the  time  consumed  by  the  dXdt  routine,  and  gray  blocks  refer  to  the 
TimeStep  routine.  The  white  arrows  represent  the  stage  exchanges  between  the 
processes. 


^Time  Lines _ HSfiDP 


0  j 
i  i 
T  \ 

3  j 

4  j 


Fig.  5  Jumpshot  result  for  the  heterogeneous  cluster 


j  ;  TimeStep  j  HdXdt  m 

Holdup  1*3  j  Recebe  andares 

il 

■  M 

i&Sfo it Sjfog&KSgSg ,L...  7 &  '  1, ...r 

W&. . 

Fig.  6  Jumpshot  result  for  the  homogeneous  cluster 


On  the  homogeneous  cluster,  with  a  drop  volume  discretization  of  1 00  classes,  the 
results  obtained  with  six  processors  showed  speedups  exceeding  a  factor  of  four 
(Fig.7).  This  result,  for  a  realistic  problem  dimension,  points  out  that  parallelization 
pays  off  for  the  intended  application  [13]. 
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Fig.  7.  Speedup  for  100  classes,  with  the  cluster  of  project  Dolphin 


4.  Results  and  discussion 

To  envisage  the  future  application  of  such  parallel  program  in  industry,  a 
homogeneous  dedicated  cluster  was  selected.  It  is  important  to  stress,  at  this  point, 
that  MPI  doesn’t  respond  dynamically  to  the  potential  inefficiencies  caused  by  non- 
uniform  computing  speeds  of  the  cluster  nodes  and  the  variability  of  shared  resources 

[14].  . 

The  program  was  executed  on  the  Beowulf  Cluster  of  the  Engineering  Faculty  of 
the  University  of  Porto.  The  present  configuration  of  this  cluster  of  commodity  PCs  is 
one  front-end  node  and  twenty-two  computing  nodes.  The  front-end  is  a  dual 
Pentium  III  550  MHz  processor,  with  512  MB  of  memory  and  18  GB  of  disk.  Each 
computing  node  is  a  single  450  MHz  Pentium  III,  with  128  MB  of  memory  and  6  GB 
of  disk.  The  nodes  are  connected  using  a  Fast  Ethernet  BayNetworks  450-24  port 
switch.  The  operating  system  is  Linux  Slackware  7.0  [15].  The  results  obtained  are 
presented  in  Fig.  8  and  Fig.  9. 

These  results  show  speedups  exceeding  a  factor  of  six,  with  eighteen  processors, 
for  a  drop  volume  discretization  of  100  classes.  It  can  be  observed  that  speedup, 
although  increasing,  shows  some  plateaus.  For  instance,  between  nine  and  eleven 
processors,  speedup  stabilizes  and  goes  up  again  for  twelve  processors.  Notice  that 
nine  and  twelve  divide  thirty-six,  which  is  the  number  of  stages  of  the  column.  From 
twelve  to  seventeen  processors  we  again  have  a  plateau,  and  another  at  a  higher  level, 
from  eighteen  to  twenty  two  processors.  Eighteen  also  divides  thirty-six.  These 
performance  leaps  are  related  to  the  way  in  which  we  distribute  the  work  for  the 
various  processors.  First,  when  the  number  of  processors  divides  the  number  of 
stages,  the  workload  is  equally  distributed.  Second,  granularity  decreases  as 
communication  time  increases,  and  the  calculation  time  per  processor  decreases. 
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The  speedup  results  for  different  discretizations  of  the  drop  phase-space,  50  and 
100  drop  volume  classes,  are  shown  in  Fig.  9.  For  twenty-two  processors  and  300 
time-steps,  the  results  show  a  speedup  increase  from  3.71  to  5.97,  being  higher  for  the 
finer  distribution.  With  100  drop  volume  discretization  classes  and  four  processors, 
simulation  is  already  faster  than  the  real  process. 


5.  Conclusions  and  future  work 

The  application  that  motivated  this  work  was  the  simulation  of  the  dynamic 
behaviour  of  liquid-liquid  agitated  columns.  Execution  times  associated  with 
sequential  algorithms  previously  published  by  some  of  the  authors  need  to  be 
improved,  in  order  to  consider  their  application  to  real-time  plant  control  applications. 
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Clustered  systems,  using  commodity  processors  and  standard  Ethernet  networks, 
are  increasingly  popular,  in  face  of  their  low  price/performance  ratio. 

We  have  shown  that  PC  clusters  are  well  suited  for  the  intended  application.  The 
results  presented  in  section  4  lead  to  the  conclusion  that  parallelization  pays  off  for 
the  numerical  technique  used,  based  upon  a  space-time  discretization  and  a  stepping 
procedure,  with  explicit  calculation  of  time  derivatives.  The  fact  that  the  speedup 
increases  with  the  problem  size  is  an  important  result  for  the  future  work,  because 
mass  transfer  simulations  involve  much  heavier  calculations  than  the  hydrodynamics. 

Extensions  of  the  algorithm  to  include  mass  transfer  are  presently  under 
development,  as  well  as  studies  concerning  the  optimization  of  the  drop  interaction 
constants  and  transport  parameters. 

On  this  version  of  the  parallel  program,  the  master  is  responsible  for  all  global 
calculations,  besides  its  own  stage  calculations,  as  a  separate  process.  With  this 
approach,  all  the  communications  are  made  only  between  the  master  and  the  other 
processes.  Work  is  in  progress  to  test  another  methodology,  where  all  processes  take 
part  of  the  global  calculations,  implying  communication  between  the  i  process  and  the 
,  Process’  instead  of  all  process  communications  being  with  the  master.  This 
solution  takes  work  from  the  master  but  increments  communication  between  the 
processes.  The  analysis  of  the  results  will  show  whether,  with  this  other 
communication  and  work  distribution  scheme,  speedup  can  be  further  improved. 
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Abstract.  Heterogeneous  networks  of  workstations  and/or  personal 
computers  (NOW)  are  increasingly  used  as  a  powerful  platform  for  the 
execution  of  parallel  applications. 

Sometimes  applications  are  developed  having  in  mind  this  type  of  het¬ 
erogeneous  environment,  but  in  most  cases  applications  already  devel¬ 
oped  for  traditional  parallel  machines  (homogeneous  and  dedicated)  are 
ported  to  NOWs,  resulting  in  performance  degradation  due  in  part  to 
less  efficient  communications  but  more  often  to  unbalancing. 

In  this  work  we  propose  a  simple  model  able  to  analyze  and  predict 
performance  on  heterogeneous  NOWs  of  regular  data-parallel  applica¬ 
tions  originally  developed  for  ring  or  2-D  mesh  topologies.  To  improve 
performance,  the  computation  time  on  the  various  nodes  must  be  as  bal¬ 
anced  as  possible.  This  can  be  obtained  in  two  ways:  by  heterogeneous 
data  partitioning  or  by  assigning  to  each  node  a  number  of  processes 
proportionally  to  its  relative  power. 

A  test  case  based  on  matrix  multiplication  is  analyzed  and  the  results 
predicted  by  the  model  are  compared  with  the  ones  collected  experimen¬ 
tally. 

Our  analysis  shows  that  an  efficient  porting  of  homogeneous  data-parallel 
applications  on  heterogeneous  NOWs  is  possible  and  can  be  achieved  in 
most  cases  in  a  quite  straightforward  and  effective  way. 


1  Introduction 

In  recent  years  networks  of  workstations  and/ or  personal  computers  are  increas¬ 
ingly  used  for  the  execution  of  parallel  applications  [7,  11].  Indeed  technological 
advances  make  available  nodes  with  high  computing  power  and  interconnecting 
networks  with  sufficiently  high  communication  speed. 

These  systems  constitute  a  viable  alternative  to  classical  parallel  machines 
(which  are  homogeneous  and  dedicated)  and  have  the  advantages  of  a  wide 
availability  and  a  good  price/performance  ratio. 

Main  features  of  NOWs  are:  heterogeneity,  since  in  most  cases  the  various 
nodes  are  different,  making  a  good  balancing  among  nodes  a  critical  aspect; 
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communication  latency  that  is  normally  higher  that  the  one  in  the  'true'  parallel 
machines,  imposing  limits  on  fine  grain  computation. 

A  simple  and  effective  way  to  achieve  good  efficiency  on  such  platforms  is  the 
use  of  the  master-worker  programming  model  with  the  pool  of  tasks  paradigm, 
which  is  self-balancing  [10].  However,  this  approach  is  only  feasible  if  tasks  are 
independent.  Moreover,  it  cannot  be  adopted  if  we  are  interested  in  the  efficient 
and  straightforward  porting  on  NOWs  of  parallel  applications  which  have  been 
developed  with  different  programming  models  for  homogeneous  and  dedicated 
parallel  systems. 

Particularly,  a  number  of  data-parallel  applications  have  been  implemented 
on  homogeneous  systems  with  regular  topologies  such  as  ring  and  mesh  using 
the  SPMD  model,  obtaining  loosely  synchronous  applications,  well  balanced  and 
therefore  providing  a  good  efficiency. 

ff  we  execute  applications  belonging  to  this  class  on  heterogeneous  NOWs, 
the  various  nodes  have  in  general  different  speeds,  thus  the  fastest  ones  exibit 
a  high  idle  time,  resulting  in  a  overall  performance  degradation.  In  order  to 
minimize  idle  time,  the  computational  work  in  each  node  must  be  as  close  as 
possible  proportional  to  the  computing  power  of  the  node. 

Similar  problems  have  been  recently  addressed  by  other  authors  In  [I]  the 
problem  arising  with  the  use  of  grid  algorithms  on  heterogeneous  workstation 
networks  is  addressed,  and  solution  based  on  sophisticated  data  allocation  meth¬ 
ods  are  proposed. 

In  this  work  we  consider  two  possible  strategies  to  obtain  a  good  load  bal¬ 
ancing:  a  single  process  per  node  with  heterogeneous  data  partitioning;  homoge¬ 
neous  data  partitioning  assigning  a  different  number  of  processes  to  each  node, 
according  to  its  computing  power. 

We  propose  a  simple  model  able  to  evaluate  performance  in  the  various  cases, 
taking  into  account  the  involved  parameters  at  the  application  level  (e.g.  com¬ 
putational  work  and  communication  amount),  at  the  architectural  level  (eg 
interconnection  network  speed)  and  at  both  levels  (e.g.  relative  speed  of  nodes). 

A  test  case  based  on  matrix  multiplication  is  analyzed  and  the  results  ob¬ 
tained  with  the  model  are  compared  with  the  ones  collected  experimentally. 

2  Regular  SPMD  applications 

Many  applications  are  suitable  for  the  parallelization  on  regular  topologies  (e.g 
ring  or  2-D  mesh)  with  a  even  distribution  of  data  among  processors. 

The  code  in  each  node  consists  normally  of  an  initialization  phase,  a  loop 
and  a  termination  phase  (Fig.  1).  In  each  loop  iteration  there  are  a  computation 
phase  and  a  communication  phase  with  neighbouring  nodes,  i.e.  nodes  connected 
by  direct  links  on  the  considered  topology. 

For  the  generic  /-th  loop  iteration  (/  =  1, ....  I),  the  elapsed  time  7)  on  the 
*-th  node  can  be  expressed  as 

_|_  'j'comm  _|_  jiidle  . 
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initialization  phase 
loop 

compute 

send  data  to  neighbouring  nodes 
receive  data  from  neighbouring  nodes 
end  loop 

termination  phase 


Fig.  1.  Process  structure  on  each  node 


Usually,  the  send  is  asynchronous  and  the  receive  is  blocking,  resulting  in  a 
loosely  synchronization  among  processes.  Since  we  have  a  regular  partitioning 
on  a  homogeneous  parallel  system,  the  application  is  self-balancing  (T?dle  ~  0). 

Sometimes,  depending  on  the  particular  application,  we  can  achieve  a  more 
efficient  implementation  slightly  modifying  the  loop  structure,  for  example  mov¬ 
ing  the  data  sending  before  the  computation. 

Communications  can  be  carried  out  using  proprietary  primitives,  optimized 
for  the  different  architectures,  but  more  often  standard  libraries  such  as  PVM 
or  MPI  are  used,  ensuring  code  portability  among  different  platforms. 

This  computational  scheme  occurs  in  various  applications  [6],  Among  the 
others  we  mention  matrix  multiplication,  long-range  interactions  [5],  finite  dif¬ 
ference  methods  for  the  solution  of  Laplace  equations.  Other  types  of  applica¬ 
tions,  such  as  finite  element  methods,  particle  dynamics  and  some  kind  of  image 
processing  [9]  have  a  similar  scheme  but  may  require  in  addition  the  use  of  global 
communications  and/or  collective  operations. 


3  The  heterogeneous  computing  environment 

Let  us  consider  an  heterogeneous  netw'ork  of  workstations  or  personal  computers 
(generically  denoted  by  NOW)  consisting  of  p  machines,  in  general  with  different 
features,  connected  by  a  switched  communication  network  (e.g.  Ethernet,  Fast- 
Ethernet  or  ATM),  with  all  links  providing  the  same  communication  speed. 

For  a  given  application  A,  let  us  assume  that  the  heterogeneity  of  nodes  can 
be  expresses  by  a  single  parameter,  namely  the  relative  speed  s,-  of  node  i  with 
respect  to  a  fixed  reference  machine,  not  necessarily  belonging  to  the  network 
[3],  si  depends  mainly  on  the  clock  speed  ratio  of  nodes  but  also  on  the  kind  of 
application,  cache  and  memory  size  and  organization. 

It  is  beyond  the  scope  of  this  paper  to  provide  a  precise  characterization  of 
the  node  speed  [14],  We  suppose  that  speeds  can  be  measured  executing  the 
application  under  investigation  on  the  various  nodes,  or  benchmarks  belonging 
to  the  same  class.  We  assume  that  the  speed  of  each  node  does  not  vary,  at  least 
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as  a  first  approximation,  if  we  measure  it  using  the  whole  application  or  anv 
portion  of  it  [8]. 

The  total  relative  speed  of  the  p  node  NOW  is 

5  =  5Z  5i' 

i 

and  the  average  relative  speed  is 

_  5 

s  =  — 

P 

Since  in  the  present  work  we  are  mainly  interested  in  discussing  the  impact 
of  heterogeneity,  we  suppose  that  the  NOW  is  dedicated.  Otherwise  we  can  use 
an  equivalent  relative  speed  given  by 


(2) 


(3) 


«,-  ( 1  —  it,-  )  (4) 

<Tj  being  the  load  factor  of  node  i. 

Let  us  assume  that  transmission  time  along  the  network  can  be  expressed  as 

ttrans  =  Os  -f-  j3  •  M 

where  a  is  the  latency  (average  value  over  the  NOW),  0  is  the  communication 
time  per  byte  and  M  is  the  message  length  in  bytes. 

We  refer  to  message  passing  library  such  as  PVM  or  MPI  [12].  In  this  case 
the  communication  time  for  a  message  is  the  sum  of  three  contributions  [10] 

tcomm  —  tpk  T  ttrans  +  tupk  (0) 

where  tpk  is  the  time  to  prepare  the  message  on  the  sending  node,  and  tupk  is 
the  time  to  unpack  the  message  on  the  receiving  node. 

The  time  for  packing/unpacking  is  greater  if  the  encoding  of  data  in  a  ma- 
c  me  independent  format  is  required;  if  all  machines  involved  in  communication 
support  the  same  data  format  no  encoding  is  needed,  and  tpk  and  tupk  are  great.lv 
reduced. 

To  be  exact,  tpk  and  tupk  depend  on  the  speed  of  the  node.  However,  since 
these  terms  are  normally  much  smaller  than  ttran$  and  tcomp,  we  can.  at  least 
as  a  first  approximation,  neglect  them  or  otherwise  consider  their  average  value 
over  the  nodes  and  add  up  it  to  ttrans ■  In  both  cases  we  assume  communication 
speed  constant  for  all  nodes. 


4  Performance  analysis  of  SPMD  applications  on  NOWs 

Let  W  be  the  total  computational  work  involved  with  the  application  H  under 
consideration,  and  let  r  be  an  atomic  computing  time  (e.g.  the  time  per  element 
or  per  operation)  on  the  reference  node. 
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The  application  .4  is  decomposed  by  data  parallelization  into  p  processes, 
each  requiring  a  computational  work  11;,  and  the  process  i-th  is  executed  on 
node  i-th. 

The  computation  time  for  one  out.  of  L  loop  iterations  on  a  generic  node  i-th 
is  therefore 


Tcomp  _  HiJL  (7) 

'  L  Si 

and  Tcomm  =  Tcomm  is  the  corresponding  communication  time,  which  under  the 
assumptions  of  the  previous  section  is  the  same  for  all  nodes. 

Dealing  with  heterogeneous  computing  systems,  we  are  mainly  interested  in 
evaluating  the  idle  time  on  each  node,  since  this  is  the  main  factor  that  can  lower 
the  overall  performance. 

Let  us  consider  an  application  with  processes  connected  on  a  logical  ring.  Let 
us  define 


W, 


.\rpC°mp  _  rj-icomp  _  ijiCOmp  __  ^ 


)f  *=  1 . P  (8) 

S{  Lj 


where  j  denotes  the  node  with  the  highest  computation  time. 

From  our  analysis  it  turns  out  that  we  can  have  two  different  behaviours, 
depending  on  computation  and  communication  times  and  on  the  degree  of  het¬ 
erogeneity  of  the  network.  More  precisely,  it  exists  a  threshold  value  for  T 


rpcomm  _ 

1  th  ~ 


^—5  y^rpCOYYip 

P 


(9) 


which  allows  to  distinguish  the  two  following  situations. 

a)  Jf  jcomm  <  j-comm '  after  a  transient  phase  of  p  iterations,  a  steady  state 
is  reached,  characterized  by  the  fact  that  the  idle  time  of  each  node  does  not 
vary  from  an  iteration  to  another,  and  it  is  given  by 


T}dle  =  AT-omp  (  i=lt  p  (10) 

The  duration  of  the  transient  phase  does  not  depend  on  the  mapping  of  processes 
to  nodes. 

b)  rpcomm  >  T^mm,  the  situation  is  slightly  more  involved  since  after  the 
transient  phase  we  get  a  periodic  behaviour  (with  period  p)  where  the  average 
idle  time  over  the  set  of  nodes  for  each  loop  iteration  is  equal  to  T 

Similar  considerations  apply  for  mesh  based  applications,  but  the  duration 
of  the  transient  phase  can  depend  on  the  mapping. 

In  the  following  we  will  deal  with  case  a),  since  in  practice  performance 
is  limited  by  unbalancing.  Of  course,  improving  load  balancing  we  move  from 
case  a)  to  case  b);  however,  in  the  b)  situation,  performance  cannot  further  be 
improved,  unless  we  modify  the  algorithm. 
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To  obtain  a  first  approximation  of  the  impact  of  heterogeneity  on  efficiency 
we  can  neglect  the  communication  overhead  (Tcomm  =  0)  Following  FT 
using  eqs.  (10,8)  the  node-level  efficiency  is  °  1  J' 


rj-'Comp 

1,1  ~  j^omp  jidie 

and  the  global  efficiency  is 


i  =  1. , . 


n  _  £i  tan 

E,».  <12> 

Of  course,  the  efficiency  above  is  an  upper  bound  of  the  actual  efficiency 

o"“  "T  "eS  u  thlCrmUni“ti0"  °Whead  •  «•  1  are  at  the  step 
eve  ,  but  they  cotne.de  w.th  the  effictenc.es  of  the  whole  computation,  since  ah 
tne  loop  iterations  are  equal. 

4.1  Evaluating  unbalancing  for  naive  porting 

In  the  case  of  a  straightforward  porting  by  homogeneous  data  partitioning  of  a 
regular  application  on  a  heterogeneous  NOW  we  have  ~  ’ 


Ttr  ¥V 

~  .  I  =  1,  ...  p 

P 


Therefore  eq.  (8)  becomes 


ATcomp  =  (A  -  ijEl,  i=1) 


and  eq.  (9)  particularizes  to 


rrcomm  ,  1  _  ,  W  T 

where  sH  denotes  the  harmonic  mean  over  s,-. 

n  ,AS  ????' St  u1616  iS  n°  ‘dIe  tlme  °n  the  sIowest  node  (“  this  case  the 
ode  with  the  highest  computation  time  is  the  slowest  one),  and  the  idle  time 

increases  wufi  the  node  speed.  Since  the  fastest  nodes  are  poorlv  exploited  the 
global  efficiency  is  low.  *  p  e 

Using  eqs.  (7,13)  we  obtain 

_  SJ 

,  *  =  l,...,p  (161 


v=-a 

s 


where  s  is  the  average  relative  speed  over  the  nodes  composing  the  NOW. 
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4.2  Strategies  to  improve  efficiency 

Eqs.  (10.8)  show  that  the  computational  work  carried  out  by  each  node  should 
be  as  proportional  as  possible  to  its  relative  speed  in  order  to  keep  unbalancing 
low,  i.e. 


i  =  1 . p  (18) 

This  can  be  achieved  in  two  ways. 

1)  By  using  an  heterogeneous  partitioning  of  data  among  processors.  With 
this  approach  some  changes  in  the  code  are  required,  thus  making  the  port¬ 
ing  more  costly.  It.  may  be  useful  to  employ  semi-automatic  tools  such  as  that 
proposed  in  [2], 

2)  By  splitting  homogeneously  the  application  in  a  number  of  processes  q 
greater  than  the  number  of  nodes  p,  and  assigning  to  each  node  a  number  of 
processes  qi  as  proportional  as  possible  to  its  relative  speed,  i.e. 


qi  ~  jq,  i=l,...,p  (19) 

Of  course,  to  maximize  performance  it  is  convenient  to  put  logically  neigh¬ 
bouring  processes  on  the  same  physical  node  [4]. 

In  the  remaining  part  of  this  section  we  address  in  more  detail  the  second 
approach.  In  this  case,  the  computational  work  for  each  node  is 

Wi  =  -W,  i=  1 _ P  (20) 

9 

The  node-level  and  global  efficiencies  become 

rji  =  —  — ,  i=l,...,p  (2D 

9j  s> 

and 


We  see  that  local  efficiencies  increase  if  the  ratios  qi/qj  are  close  to  the  cor¬ 
responding  ratios  Si/sq.  Moreover,  for  a  fixed  q,  rj  is  maximum  if  qj/sj  = 
maxj(gt/s,)  is  minimum. 

The  number  of  processes  required  to  achieve  a  good  balancing  increases  with 
the  degree  of  heterogeneity  of  the  network,  and  correspondingly  the  process 
granularity  must  decrease. 

To  show  an  example  of  application,  let  us  consider  a  NOW  of  six  nodes,  with 
constant  total  power  5  =  6,  and  therefore  s  =  1.  We  choose  the  four  different 
configurations  with  increasing  heterogeneity  (expressed  by  h  =  1  —  smm/s,  where 
Smin  denotes  the  lowest  relative  speed  in  the  NOW)  reported  in  Table  1,  and  we 
vary  the  number  of  total  processes  from  6  up  to  75.  For  each  configuration  and 
each  value  of  q  we  find  the  optimal  qi  using  the  criterion  of  minimizing  qj/sj. 
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Table  1.  Configurations  with  increasing  heterogeneity  used  in  Fig.  2;  p  =  6;  5  =  6 


0.6 

0.8 

1.0 

0.4 

0.6 

0.8 

0.2 

0.2 

0.6 

0.1 

0.4 

0.7 

1.0 

1.2 

1.4 

1.2 

1.4 

1.6 

0.6 

2.2 

2.2 

1.3 

1.6 

1.9 

h 

0. 10 
0.60 
0.80 
0.90 


Fig.  2  shows  the  global  efficiency,  computed  from  eq.  (22),  versus  the  degree 
of  heterogeneity  of  the  system.  We  see  that,  as  expected,  for  a  fixed  q  efficiency- 
decreases  as  h  increases,  and  the  effect  is  stronger  when  q  is  smaller.  However,  a 
reasonable  number  of  processes  (e.g.  q  ~  50)  is  sufficient  to  achieve  an  efficiency 
of  about  0.8,  also  with  a  highly  heterogeneous  network. 
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Fig.  2.  Efficiency  vs.  the  total  number  of  processes  q,  for  various  values  of  h;  p  =  6: 
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Of  course,  a  trade-off  exists  between  balancing  and  the  need  of  keeping  low 
other  overheads,  in  particular  the  time  lost  due  to  context  switching,  which  is 
not  considered  in  the  present  analysis. 

The  approach  based  on  the  use  of  multiple  processes  per  node  permits  a 
complete  reuse  of  code  developed  for  homogeneous  platforms. 
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5  Simulation  and  experimental  results 


We  set  up  a  simple  model  able  to  simulate  the  execution  of  regular  applications 
on  NOWs.  with  the  three  different  approaches  outlined  in  the  previous  section. 
The  model  uses  some  parameters  at  the  hardware  level  (i.e.  the  number  of  pro¬ 
cessors  p  and  the  network  speed,  expressed  by  a  and  .3),  and  some  parameters 
which  also  depend  on  the  selected  application  (i.e.  the  atomic  time  r  on  the  ref¬ 
erence  node  and  the  relative  node  speeds  s,).  The  third  approach  also  requires 
the  total  number  of  processes  q.  From  such  low  level  parameters,  the  model  com¬ 
putes  for  the  given  application  the  computation,  communication  and  idle  times 
at  the  loop  iteration  level  for  each  processor.  In  this  way  the  model  yields  the 
figures  of  speed-up  and  efficiency  of  the  whole  application. 

The  model  is  tested  using  the  matrix  multiplication  algorithm  that  computes 
C  =  .4  x  B,  with  .4,  B  and  C  n  x  n  matrices,  on  a  logical  ring  of  processes,  as 
described  in  [13]. 

In  the  original  SPMD  implementation  with  homogeneous  data  partitioning 
each  processor  i  stores  a  slice  of  matrix  A  and  a  slice  of  matrix  B,  each  comprising 
rows  from  (i-l)n/p  to  i-n/p.  Slices  of  ,4  remain  local  to  the  various  processors, 
whereas  slices  of  B  circulate  along  the  ring.  The  whole  computation  requires  p 
loop  iterations  and  at  the  end  processor  i  has  computed  n/p  rows  of  C,  from 
row  (i  -  1  )n/p  to  row  i  ■  n/p. 

So,  the  computation  time  of  node  i  during  the  l-th  iteration  is 


jr,  comp 


2  —  —  ,  i  = 

P-  Si 


(23) 


and  in  each  iteration  n2/p  elements  of  B  are  moved  between  neighbouring  nodes. 

Using  the  heterogeneous  data  partitioning  approach  means  in  this  case  to 
assign  slices  of  matrix  A  to  each  node  with  a  number  of  rows  proportional  to  its 
relative  speed,  whereas  matrix  B  is  still  evenly  partitioned  among  nodes. 

The  third  approach  is  exactly  the  same  as  the  first,  with  the  exception  that 
q  processes  (with  q  >  p)  are  generated  and  the  optimal  q,  are  given  by  eq.  (19). 

The  various  versions  of  this  test  program  are  implemented  using  C  language 
and  PVM  v.  3.4  and  executed  on  a  variable  number  of  nodes  belonging  to  a 
NOW  of  six  workstations  connected  by  a  switched  Ethernet.  Table  2  shows 
the  characteristics  of  the  various  nodes  and  the  total  power  of  the  different 
configurations. 

The  trials  are  executed  on  dedicated  nodes  and  with  a  low  traffic  on  the 
network.  The  measured  value  of  the  time  per  element  on  the  reference  node  is 
r  =  0.56  psec.  We  measure  on  the  network  the  values  a  —  1  msec  and  3  — 
1  psec. 

Experimental  data  has  been  collected  using  1000  x  1000  floating  point  matri¬ 
ces.  Table  3  reports  the  measured  and  simulated  speed-up  for  the  three  different 
approaches.  As  expected,  the  speed-up  of  the  straightforward  homogeneous  par¬ 
titioning  is  wrell  below  the  ideal  one,  while  the  two  proposed  strategies  to  reduce 
unbalancing  yield  considerably  better  speed-up  figures. 
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Table  2.  The  first  column  identifies  the  configuration,  that,  includes  nodes  up  to  the 
current  row;  for  each  configuration  the  type  and  the  relative  speed  of  the  nodes,  the 
total  computing  power  and  the  degree  of  heterogeneity  h  are  reported 


C'onfig.  Id. 

Workstation 

Relative  speeds 

Available  computing  power 

h 

* 

Sparc-20 

1.00 

1.00 

_ 

Cl 

SGI-02 

1.87 

2.87 

0.31 

C2 

SG1-02 

1.90 

4.77 

0.37 

C3 

Sparc- Ultra  5 

1.87 

6.64 

0.40 

C4 

Sparc- Ultra  5 

1.85 

8.49 

0.41 

Co 

Indigo  2 

5.87 

14.36 

0.58 

Table  3.  The  first  column  gives  the  configuration  identifier;  the  SPMD  columns  pro¬ 
vide  speed-up  for  homogeneous  SPMD  application  measured  (M)  and  simulated(S); 
HD  columns  summarize  speed-up  for  heterogeneous  data  partitioning;  the  VP  colums 
provide  speed-up  for  homogeneous  data  partitioning  but  with  a  number  of  processes  on 
each  node  proportional  to  its  relative  speed  (the  total  number  of  processes  q  is  reported 
in  the  last  column) 


Config.  Id. 

SPMD-M 

SPMD-S 

HD-M 

HD-S 

VP-M 

VP-S 

<7 

Cl 

2.06 

1.99 

2.89 

2.86 

2.71 

2.80 

3 

C2 

3.09 

3.00 

4.69 

4.76 

4.68 

4.66 

5 

C3 

4.62 

4.00 

6.62 

6.62 

6.14 

6.51 

7 

C4 

5.80 

5.00 

8.36 

8.48 

8.18 

8.35 

9 

C5 

6.54 

5.95 

13.23 

14.24 

12.67 

13.6 

15 

Measured  and  experimental  data  are  in  most  cases  in  good  agreement,  thus 
confirming  that  the  proposed  model  is  quite  reliable.  The  maximum  errors  occur 
in  the  case  of  SPMD  homogeneous  implementation,  and  it  is  due  to  an  under¬ 
estimation  of  the  relative  speed  of  the  slowest  nodes.  In  fact  we  assume  that 
the  relative  speed  of  each  node  does  not  vary  with  the  data  size  handled  by 
the  node.  Indeed,  we  can  sometimes  observe  a  gain  in  processor  speed  when  the 
amount  of  local  data  decreases,  for  example  due  to  better  use  of  the  hierarchy 
of  memories.  Phis  is  more  relevant  in  the  homogeneous  data  partitioning  case 
where  the  relative  weight  of  the  slowest  nodes  is  greater. 
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6  Conclusions 

We  analized  the  problem  of  porting  data-parallel  applications  originally  devel¬ 
oped  for  homogeneous  parallel  systems  with  regular  topologies  (e.g.  ring  or  mesh) 
to  network  of  workstations  and/or  personal  computers. 

For  this  kind  of  computing  resources,  maintaining  the  even  partitioning  of 
data  among  processors  yields  poor  performance,  since  efficiency  is  limited  b\ 
unbalancing,  that  increases  with  the  degree  of  heterogeneity  of  the  network. 

Two  strategies  are  considered  to  overcome  this  problem:  heterogeneous  data 
partitioning  or  allocation  to  each  node  of  a  number  of  processes  proportionally 
to  its  relative  power. 

A  simple  model  is  proposed  to  analyze  and  predict  performance  of  the  con¬ 
sidered  class  of  applications  using  the  various  approaches. 

The  model  is  tested  using  a  matrix  multiplication  algorithm  with  processes 
arranged  in  a  ring  topology.  A  good  agreement  is  obtained  between  simulated 
and  experimental  figures  of  performance  both  for  the  naive  unbalanced  imple¬ 
mentation  and  for  the  two  improved  implementations. 
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Abstract.  This  work  presents  a  parallelization  of  a  recursive  decoupling 
method  for  solving  tridiagonal  linear  system  on  distributed  memory  com¬ 
puter.  We  study  the  fill-in  in  the  algorithm  to  optimize  the  execution  of 
the  scalar  algorithm  and  to  perform  the  communications.  Finally,  we 
evaluate  the  algorithm  through  specific  test  on  the  Fujitsu  AP3000. 


1  Introduction 

In  recent  years  considerable  effort  has  been  devoted  to  solve  tridiagonal  systems 
(TS),  a  very  important  class  of  linear  systems  which  appear  when  the  finite  dif¬ 
ferential  method  is  used  to  solve  differential  equations  in  partial  derivates  such 
as  simple  harmonic  motion,  Helmhotz,  Poisson,  Laplace  and  diffusion  equations. 
The  finite  differential  method  involves  the  discretization  of  the  differential  equa¬ 
tion  and  subsequently  the  solution  of  the  tridiagonal  systems  thus  generated. 

There  are  many  algorithms  for  solving  TS,  such  as  Gaussian  elimination  or 
LU  elimination,  that  have  proved  to  be  the  most  effective  sequential  algorithms 
on  serial  computers.  However,  these  algorithms  cannot  be  directly  adopted  to 
parallel  computers.  Much  research  has  been  undertaken  on  parallel  algorithms  for 
solving  TS.  Hockney  proposed  the  cyclic  (odd-even)  reduction  (CR)  algorithm  in 
1965.  Although  originally  proposed  as  sequential,  this  algorithm  can  be  adapted 
to  run  on  a  wide  range  of  parallel  architectures  [8, 5].  In  addition,  new  methods 
for  increasing  the  parallelism  of  CR  algorithm,  such  as  PARACR  [9]  or  radix-p 
CR  algorithm  [8],  have  been  proposed.  On  the  other  hand,  other  well  known 
strategies  have  been  adapted  to  get  new  TS  parallel  algorithms,  such  as  the 
proposed  by  Egecioglu  et  al.  [6]  (recursive  doubling  strategy),  Lin  and  Cheng 
[12] (prefix),  and  Wang  and  Mou  [17]  and,  Spaletta  and  Evans  [16],  which  exploit 
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the  parallelism  of  the  divide-and-conquer  strategy.  Finally,  a  group  of  hybrid 
algorithms  have  been  proposed  that  are  based  on  partitioning  the  system  into 
blocks  of  equations,  using  a  local  algorithm  to  reduce  the  subsystem  in  each  block 
and  a  global  algorithm  to  solve  the  reduced  system.  In  this  group  we  include  the 
algorithms  by  Krechel,  Plum  and  Stuben  [10],  Cox  and  Knisley  [4],  Muller  and 
Scheerer  [15],  Matton,  Williams  and  Hewett  [14]  and  Amodio  and  Brugnano  [2], 
In  [1]  we  have  classified  the  above  TS  algorithms  in  terms  of  their  data  flows 
and  presented  a  unified  parallelization  on  computers  with  mesh  topology  and 
distributed  memory. 

In  this  paper,  we  consider  the  parallelization  of  the  recursive  decoupling  al¬ 
gorithm  by  Spaletta  and  Evans  [16]  on  a  distributed  memory  multiprocessor. 
This  algorithm  has  a  very  good  behavior  in  terms  of  accuracy  as  the  problem 
size  increases  and  the  partitioning  process  leads  to  independent  systems.  As  sta- 
b  ished  in  previous  works,  the  memory  allocation  requirement  is  demanding  [16] 
and  the  execution  times  are  not  competitive  with  other  partitioning  methods  [1], 
In  this  paper  we  propose  a  technique  to  reduce  the  execution  time  of  the  scalar 
algorithm,  minimize  the  memory  requirements  and  to  optimize  the  communica¬ 
tions  in  the  parallel  implementation.  This  technique  is  based  in  the  sparsity  of 
the  matrix  obtained  in  the  recursive  fill-in  process  of  this  algorithm. 

The  rest  of  the  work  is  organized  as  follows:  in  Section  2  we  present  the 
recursive  decoupling  algorithm  by  Spaletta-Evans.  The  parallel  algorithm  is  pre¬ 
sented  in  Section  3.  Experimental  results  on  the  Fujitsu  AP3000  multiprocessor 
are  shown  in  Section  4.  Finally,  in  Section  5  we  present  the  conclusions. 


2  The  Recursive  Decoupling  Algorithm 

We  consider  a  set  of  N  linear  equations  with  N  unknowns 


Avl  =  d, 

where  A  is  a  tridiagonal  matrix  N  x  N  oi  the  form 


(1) 


A  = 


/  ho  co 
I  0,1  bi  a 
[  0,2  C2 


\ 


,  with  |6j|  >  |a<|  +  |cj|,  Vi  =  0, 1, ...,  N  -  1. 


V 


aN- 2  h/v_2  C/V-2  I 
aN~  1  hjV-1  / 


With  no  loss  of  generality  we  will  assume  that  the  number  of  equations  fifa 
power  of  two.  We  will  denote  m  =  N/2  =  2n_1. 

The  recursive  decoupling  algorithm  is  based  in  the  recursive  calculation  of 
the  inverse  of  matrix  A  by  means  of  the  Sherman-Morrison  formula  [7],  To  this 
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goal,  we  decompose  the  matrix  A  (2)  as  follows: 


where 


and 


(  eo  Co 

Ol  ei 

e2  c2 

ci3  C3 

\ 

0 

m  —  1 

+  E 

f  X°  \ 

Xl 

X2 

Xz 

(j) 

(  yo  \ 
yi 

yi 

2/3 

0 

V 

e-N-i  cm- 2 

ajv-i  ejv-i  / 

i=i 

XN- 2 

V^JV-l  ) 

VN- 2 
\VN-lJ 

eo  =  b0 

e-N- 1  =  bN-i 

e'2i~l  ~  \  when  j  =  1, . . . ,  m  —  1. 

ey  =  -  C2j-1  J 


(j)T 


(3) 


(4) 

(5) 


In  expression  (3),  all  the  elements  in  the  vector  columns  and  y(j)  have  only 
two  non-zero  elements  at  the  positions  2 j  -  1  and  2 j,  that  is 


xU)  =  (0,---,0,l,l,0,---,0)T  (6) 

y(j4  =  (0,  •  •  • ,  0, 02;, C2j-i, 0,  •  •  • , 0)T 

In  matrix  notation,  the  partitioning  of  A  given  in  equation  (3)  is  denoted  as 


m— 1 

A  =  J  +  x0')y(j)r,  (7) 

i=i 

where  J  is  the  2x2  block  diagonal  matrix  on  the  left  in  equation  (3). 

The  basic  idea,  underlying  the  choice  of  this  particular  partitioning,  is  given 
by  the  Sherman-Morrison  method.  Sherman-Morrison  proved  that,  given  two 
N  xN  matrices  A  and  J  such  that  A  =  J  +  x  ■  yT ,  the  inverse  of  matrix  A  can 
be  obtained  by  the  formula 

A~l  -  J~l  -  a(J-1x)(yT  J-1),  a  =  1+yrj-ix-  (8) 

To  directly  compute  the  inverse  of  matrix  A  would  cost  0{N3)  arithmetic  opera¬ 
tions,  while  the  use  of  formula  (8)  only  implies  0{N 2)  operations.  When  applied 
to  solve  a  linear  system  of  equations  Tu  =  d,  the  solution  will  be 

u  =  A_1d  =  (J  -  aJ_1xyT)J-1d.  (9) 

This  process  avoids  the  explicit  computation  of  the  inverse  matrix. 

The  Recursive  Decoupling  method,  described  in  [16],  derives  the  solution  of 
system  (1)  by  considering  that  A=  J  +  Y^j=i  +  x(m  1)y(m  *  »  ^en 
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applying  the  Sherman-Morrison formula  (7)  to  matrices  A  and  J+YmS' 2  x{j)y <-dT 
The  recursive  procedure  is  as  follows  J~l 

Mh  =  ^<7  + 

Oih-l  =  1/(1- 

Index  h  goes  from  1  to  m  -  1,  M0  being  the  matrix  J"1  and  the  last  matrix 
m-i  wi  be  A  .  Let  us  denote  as  g(h~l)—Mh~iXh •  Observe  that  these  vectors 
are  needed  to  obtain  the  recursive  formula  (10)  and  can  be  computed  using  a 
similar  recursive  method 

g(h)  =  (/-a.^gO-Dy^-DT)  Mh_2Xh 

h-l 

=  n(/-QiS('?)y(j)r) 

J=1 

In  order  to  obtain  the  final  solution  u  =  A^d,  from  (10)  follows  a  recursive 
formula  similar  to  (11) 


r-i 


Xfe. 


(11) 


-l 


5>(;Vi)T 


J= 1 


=  (I~  Ma_2 


(10) 


u  =  A  *d  = 


m— 1 

W'- aJS(J>y(j>T) 


i= l 


J_1d. 


Then  we  need  to  carry  out  the  following  steps, 


(12) 


step  1  In  this  step  the  matrix  J  1  is  calculated,  as  well  as  the  product  J-1d,  the 
initial  value  of  u.  Given  the  shape  of  matrix  J,  its  inverse  may  be  obtained 
by  calculating  the  inverse  of  each  2x2  block  Jj, 


/— 1  _  _A_ 

Jj  ~  Aj 


e2j-l  ~C2(j-l) 


\  &2(j-l) 

so,  the  value  of  J-1d  becomes 

l 


>  Ai  ~  e2(j-i)e2j-i  -  a2j-ic2{j_1).  (13) 


J-1d  = 


(eid0  -  codi)/Ai 
(— Qido  -f-  eodi)/Ai 


(e2f-1^2(j  — 1)  —  C2(j-i)d2j_i)/ Aj 
(~a2j-id2(j-i)  +  e2(j-i)d2j-1)/Aj 

{e2m-ld2m-2  ~  C2m-2d2rJl-\) / Am  . 
\  (—a-2m-ld2m-2  +  62rn—2d2rn^.i  )/AmJ 


(14) 
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Step  2  Compute  the  initial  vectors  g(j)  =  J  for  indices  j  =  -  1. 

Because  the  pattern  of  the  vector  x(j),  the  vector  g(j)  has  only  non-zero 
elements  from  2 j  —  1  to  2j  -t-  2  positions, 


0 


0 

-C2(j- l)/^2j-l 
C2(j-l)/^2j-l 
e2j-i/^2j 
-a2j-i/A2j 

0 

0 


(15) 


Step  3  In  this  last  stage,  vectors  u  y  g(j)  are  updated  by  using  the  equations  (11) 
and  (12).  This  rank-one  updating  procedure,  which  also  make  use  of  the 
particular  shape  of  vectors  x!j)  and  yU) ,  can  be  described  as  follows: 


for  k  =  1, 2,  •  •  • ,  n  —  1 

for  j  =  2fc-\  2-1 -2*71,  2k 

ctj  =  1/(1  +  y0)Tgw) 
u  =  (/  -  ajgU)yU)T)  u 
for  i  =  2fc,  2n_1  —  2fc,  2k 

g(<)  =  (l-otj g0)y0)T)  g(,) 

end 

end 


3  The  Parallel  recursive  decoupling  algorithm 

In  this  section  we  propose  some  modifications  to  the  above  sequential  algorithm 
in  order  to  reduce  storage  and  execution  time.  Then,  we  propose  a  parallelization 
of  the  algorithm. 

Note  that  in  step  2,  when  we  calculate  gu),  (0  <  j  <  m-  l),the  initial  vectors 
x U)  only  contain  2  non-zero  elements.  Therefore,  at  the  1st  iteration  the  vectors 
g (i)  are  composed  of  4  non-zero  elements  and,  in  general,  at  iteration  k,  g(^  is 
a  vector  with  2fe+1  non-zero  elements,  namely  components  from  2j  +  1  -  2  to 

2 j  2k. 

Observe  at  the  example  in  Fig.  1  that  to  compute  vector  g(,)  we  do  not  need 
all  the  vectors  in  each  iteration  k.  In  fact,  only  are  needed  those  vectors  g(J) 
which  have  elements  different  from  0  just  at  row  i,  where  column  i  of  matrix 
(  /  -  otj g(j)2y(j)T)  has  also  elements  different  from  0.  It  can  be  easily  proved 

that  this  happens  if  2k+1  LjtVrJ  <  i  <  j  +  2k~\-  Then’  the  internal  looP  *  in  the 
step  3  of  the  recursive  procedure  can  be  simplify  as  follows 
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begin=2*i+1[?fLTj 

if(  begin  ==  0  )  then  begin  =  2k 

for  i  =  begin;  j  +  2k~\  2k 

g(!)  =  (/  ~  ajgU)yU)T)  g(i) 

end 


On  the  other  hand,  the  fill-in  process  only  occurs  at  several  points  of  the 
a  gorithm,  where  the  values  associated  to  specifics  g(*>  are  computed.  These 
values  are  calculated  using  the  recursive  tree  procedure  described  in  step  3.  For 
example,  m  Fig.  2,  at  the  iteration  1,  j  has  the  values  {1,3, 5, 7},  at  the  iteration 
2  has  the  values  {2, 6}  and  at  the  iteration  3  has  the  value  {4}.  In  addition,  the 
vectors  g  ,  g  \  g<  and  g<  >  are  used  only  at  the  1st  iteration  and  during 
e  execution  of  the  algorithm  keep  at  most  4  non-zero  elements.  Similarly, 
vectors  g  -  and  g  are  used  until  the  2nd  iteration  and  the  number  of  non¬ 
zero  elements  is  less  than  8,  and  so  on.  As  a  consequence,  not  all  the  vectors  gW 
perform  the  fill-in  procedure  in  the  same  way.  We  take  advantage  of  this  fact  to 
gather  the  non-zero  elements,  then  saving  memory.  Instead  of  arrays  of  If  1  of 
size  N/2  x  N/2,  we  have  arrays  of  size  (n  -  1)  x  N/2.  At  the  stage  2  in  Fig.  1 

g  J)  aie  St°red  f0t  the  CaSe  N  =  16’  Memory  savings 


Concerning  the  parallelization  of  the  algorithm,  Fig.  1  summarizes  their 
stages  by  means  of  an  example  (N  =  16  equations  on  4  PEs).  In  this  algorithm 
the  responsibility  to  perform  the  computation  of  the  initial  steps  is  divide  among 
all  the  processors.  Therefore,  the  process  of  partitioning  matrix  A,  given  in  (7) 
as  well  as  the  distribution  of  vectors  u  and  d  is  refered  as  preliminary  stage 
At  this  stage,  communications  of  the  c(^.i)/P_1  occur  from  processor  i  to  pro¬ 
cessor  i  +  1  and,  for  the  a(N.i)/P,  from  processor  i  +  1  to  processor  i,  where 
1  -  ■  ■  ,P  -  1,  P  being  the  number  of  processors  (see  Fig.  1). 

After  preliminary  stage,  the  steps  1  to  3  are  computed.  Having  in  mind  the 
ock  diagonal  structure  of  matrix  J,  step  1  may  be  computed  concurrently  in 
all  the  processors  without  any  communication,  since  the  m  subsystems  in  (13) 
can  be  solved  in  parallel.  The  same  happens  at  stage  2,  but  in  this  case  the 
m  1  subsystems  in  (15)  are  to  be  solved.  Some  vectors  g(j>  are  distributed 
among  two  processors.  But  this  does  not  imply  any  comunication  since  each 
processor  calculates  the  components  of  the  vector  g«>  using  local  data.  As  an 
example,  in  Fig.l,  the  components  {2,3}  of  vector  g(2>  are  in  processor  0  and 
the  elements  {4,5}  in  processor  1.  This  distribution  of  vector  gU)  provides  a 
better  load  balance.  P 


At  the  stage  3  no  communications  are  required  during  the  first  n  -  p  -  1 
iterations.  However,  the  last  p  iterations  require  communications  since  the  i  -  th 
element  of  vector  u  must  be  transfered  to  all  the  processors  containing  elements 
of  the  i  -  th  column  of  the  matrix  (i  -  ajgMyW)  which  are  different  from  0. 
In  addition,  the  k  -  th  element  of  vector  gW  must  be  transfered  to  processors 
which  contain  elements  of  column  k  of  (/  -  ajg<i)yW,T)  differents  from  0. 
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Preliminary  stage 


Stage  1 


g"’ 


Stage  3 


Fig.  1.  Scheme  of  the  parallel  algorithm  for  N  =  16  equations  and  4  processors.  We 
denote  as  x  the  elements  differents  from  0  either  in  vectors  and  matrices.  Circles 
indicate  data  to  be  transfered  and  arrows  point  out  destination  processors.  At  stage  2, 
computation  of  g(1>  from  x(1)  and  g  is  summarized.  At  stage  3,  the  Figure  shows  how 
g(2)  for  k  =  1  and  g(4)  k  =  2  are  calculated. 
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Fig.  2.  the  vectors  g(j  )  are  calculated  at  each  iteration  of  step  3  for  N  =  16  equations. 

4  Evaluation 

The  recursive  decoupling  algorithm  has  been  implemented  on  the  Fujitsu  AP3000 
distributed  memory  computer  [13]  using  the  message  passing  programming  model. 
We  have  used  the  MPI  programming  environment.  To  verify  the  performance 
of  the  parallel  algorithm,  we  used  a  test  diagonal  system  (with  know  solu¬ 
tion),  whose  coefficients  matrices  satisfy  the  condition,  |6j|  >  |aj|  +  |c,|,  Vz  = 
0, 1, ...,  N  —  1.  This  test  is  described  below, 


(2-1  \ 

/  u0  \ 

m 

-1  2  -1 

u  i 

0 

-1  2  -1 

— 

0 

-1  2  -1 

V-N- 2 

0 

V  -1  2 

\UN-1  / 

w 

whose  exact  solution  is  an  iV-dimensional  vector  u  with  components: 

JV"  4-1  —  { 

Ui  =  "jV  +  i  ’  Vf  =  l,---,1V.  (17) 

The  experiments  were  performed  on  matrices  of  size  ranging  from  16384  (214) 
to  1048576  (220)  for  the  test  (16).  As  we  can  see  in  Table  1,  the  increasing  number 
of  processors  produces  a  reduction  in  the  execution  time  of  the  algorithm.  We 
observe  that  this  method  presents  a  high  efficiency  for  all  the  sizes  of  equations. 

Fig.  3  depicts  the  experimental  results.  So,  in  Fig.  3. a  we  show  the  efficiency 
of  the  modified  sequential  algorithm  we  propose  related  to  the  initial  algorithm 
efficiency.  Thus,  Observe  than  performance  increases  more  than  91%  for  any 
value  of  N.  On  the  other  hand,  in  Fig.  3.b  we  show  the  efficiency  for  the  par¬ 
allel  algorihtm  for  some  values  of  parameter  N.  Efficiency  was  calculated  using 
the  execution  time  of  the  sequential  code.  The  parallel  algorithm  exceeds  the 
ideal  speedup  due  to  an  efficient  use  of  local  memories  and  the  communication 


-538  - 


VECPAR  '2000  -  4  th  International  Meeting  on  Vector  and  Parallel  Processing 


Table  1.  Execution  times  in  seconds  measured  on  the  AP3000  for  differents  number 
of  processors.  The  size  of  matrices  are  from  16384  (214)  to  1048576  (2"°). 


p 

2is 

216 

217 

218 

2 19 

220 

1 

0.4231 

0.9098 

1.9969 

4.3576 

9.2092 

19.3657 

2 

0.1830 

0.3561 

0.7990 

1.7162 

3.9248 

7.7731 

4 

0.0848 

0.1866 

0.3813 

0.8379 

1.9117 

3.9418 

8 

0.0427 

0.0897 

0.1897 

0.3988 

0.9471 

1.9001 

optimization.  Therefore,  these  results  prove  that  the  techniques  employed  to 
parallelize  the  algorithm  permit  to  obtain  a  good  performance  on  distributed 
memory  computers.  A  last  observation  is  that  our  parallel  program  is  scalable. 
That  is,  in  order  to  maintain  a  constant  efficiency,  N  grow  at  the  same  rate  as 
P,  which  we  just  observed  in  Fig.  3.b. 


n  Number  of  processors 

Fig.  3.  (a)  Efficiency  of  the  modified  sequential  algorithm  we  propose  related  to  the 
initial  algorithm  efficiency,  (b)  Efficiency  of  the  parallel  algorithm  on  the  AP3000  for 
N  =  215  and  N  =  220rad  data. 


It  is  difficult  to  make  a  comparison  with  other  implementations  of  the  Re¬ 
cursive  Decoupling  Method  for  Solving  Tridiagonal  on  other  machines,  but  the 
speedup  may  be  compared  with  the  presented  in  [16,3].  Their  numerical  results 
are  obtained  in  the  Balance  8000  multiprocessor  system.  The  maximum  speedup 
is  2.1075  with  N  =  512  and  P  =  8.  Climent  et  al.  [3]  present  theoretical  predicted 
times  for  their  algorithm  on  a  Cray  T3D.  According  to  the  efficiency  results  we 
can  conclude  that  our  algorithm  presents  a  significant  better  performance. 

5  Conclusions 

In  this  paper,  we  have  propose  a  parallelization  of  the  recursive  decoupling 
method  for  solving  tridiagonal  linear  system  on  distributed  memory  computer. 
The  method  showed  an  optimization  of  the  memory  requirements,  a  superlinear 
speedup  and  scalability.  The  memory  savings  come  from  a  compressed  storage 
policy  which  eliminates  the  null  elements.  On  the  other  hand,  we  study  the  fill-in 
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in  the  algorithm  to  optimize  the  execution  of  the  scalar  algorithm.  This  way,  the 
performance  increases  more  than  91%  for  any  value  of  N, 
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Abstract.  We  describe  the  use  of  BLAS  kernels  as  a  key  to  efficient 
vectorization  of  m-th  order  linear  recurrence  systems  with  constant  co¬ 
efficients.  Applying  the  Hockney-Jesshope  model  of  vector  computation, 
we  present  the  performance  analysis  of  the  algorithm  which  considers 
also  the  influence  of  memory  bank  conflicts.  The  theoretical  analysis  is 
supported  by  experimental  data  collected  on  two  Cray  vector  computers. 
Keywords,  m-th  order  linear  recurrence  systems,  BLAS,  LAPACK,  vec¬ 
torization,  memory  bank  conflicts,  speedup. 

Conference  topics.  Numerical  methods,  Parallel  and  distributed  algo¬ 
rithms. 


1  Introduction 

The  critical  part  of  several  numerical  algorithms  reduces  to  the  solution  of  a 
linear  recurrence  system  of  order  m  for  n  equations  with  constant  coefficients 
[13,  16]: 

r  0  for  k  <  0 

Xk  =  |  fk  +  E  ajxk-j  for  1  <  &  <  n.  ^ 

The  efficient  solution  to  this  problem  is  of  particular  interest  in  case  of  vector 
computers  as  optimizing  compilers  are  not  able  to  generate  machine  code  that 
would  fully  utilize  the  underlying  hardware.  As  our  experiments  show,  even 
Cray’s  Fortran  compiler,  usually  recognized  as  the  best  vectorizing  compiler  on 
the  market,  is  in  this  category  (see  Section  5).  In  addition,  numerical  libraries 
(like  LAPACK  [1],  implemented  in  the  Cray's  scilib  library)  instead  of  problem 
(1)  provide  a  solution  to  a  more  general  problem: 

{0  for  k  <  0 

k- 1  (2) 

fk  +  E  for  l<k^  n- 

j=k-m 

Solution  to  this  problem  requires  more  memory  and,  in  the  case  of  LAPACK 
routines,  the  computational  efficiency  is  obtained  primarily  by  solving  it  for 


-541- 


FEUP  -  F aculdade  de  Engenharia  da  Universidade  do  Porto 


multiple  right  hand  sides.  In  case  when  the  original  problem  (I)  is  solved,  a 
simple  application  of  a  LAPACK  routine  does  not  result  in  achieving  maximum 
performance  (see  Sect  ion  5).  The  aim  of  our  work  is  thus  to  find  the  performance- 
optimal  solver  for  the  original  problem  (I).  Based  on  our  earlier  work  [9.  10. 
11.  14]  we  have  decided  to  approach  the  problem  by  augmenting  the  divide- 
and-conquer  approach  proposed  there  by  application  of  BLAS  kernels.  We  then 
proceeded  to  establish  the  optimal  parameters  to  obtain  maximum  efficiency  and 
to  eliminate  memory  bank  conflicts. 

We  proceed  as  follows.  In  the  next  section  we  introduce  the  algorithmic  frame¬ 
work  used  in  our  work.  We  follow  with  the  description  of  implementation  details 
of  the  proposed  algorithm.  We  then  sketch  the  theoretical  analysis  of  computa¬ 
tional  complexity.  Wre  complete  our  report  by  describing  and  analyzing  results 
of  our  experiments  performed  on  Grays  C-90  and  SV-1. 


2  Algorithm  description 

In  our  considerations  we  will  assume  that  n  >  in,  i.e.  the  order  of  a  recurrence 
system  is  rather  small.  The  idea  of  the  algorithm  is  to  rewrite  (1)  as  the  following 
block  system  of  linear  equations  ^ 


where  for  q  =  n/p  >  m  we  have 


Note  that  L  is  a  Toeplitz  matrix,  what  means  that  entries  are  constant  along 
each  diagonal.  The  system  (3)  corresponds  to  the  following  recurrence  system  ° 

f  xx  =  L~lil 

\Xj=  L-Hj  -  L~1Uxj-i  for  j  =  2,....p.  (5) 

To  solve  this  system  let  us  consider  the  structure  of  the  matrix 

m  m 

p  =  (6) 
k~l  l=k 
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where  ek  denotes  &-th  unit  vector  of  X9.  Obviously,  equation  (5)  reduces  to  the 


form 


X]  =  L  xfi 

m 

Xj  =  L_1fj  +  X]  ajYk  for  3  =  2> •  •  • 
k=  1 


where  Lyk  =  efc  and  aj'  =  E(=fc  Om+k-iX<j-i)q-m+i-  Note  that  to  compute 
vectors  yk  we  need  to  find  only  the  solution  of  the  system  Ly i  e^,  namel\ 
y L  —  ( 1 ,  y2 , . . .  .yq)T-  We  can  now  form  vectors  yk  as  follows 


yk  =  (0....,0,l.y2.---,.y9-/c+i)1 

fe-i 


(8) 


This  yields  that  the  number  of  subsystems  we  must  solve  does  not  depend  on 
the  order  of  the  system.  To  find  vectors  z j  and  yi  we  must  solve  p  + 1  recurrence 
systems  of  order  m  for  q  equations. 


3  Implementation  details 

Now  let  us  consider  the  possible  implementations  of  the  proposed  algorithm.  We 
can  omit  the  assumption  that  n  =  pq  because  after  we  choose  integers  p  and  q 
we  can  apply  (7)  to  find  oq , . . .  ,xpq  and  (1)  to  find  xpq+i, . . .  ,xn.  First  we  ha\e 
to  find  vectors  zj  and  yi .  We  can  do  it  efficiently  by  using  a  sequence  of  JtXPY 
operations  y  <-  y  Tax.  Note  that  -AXPY  consists  of  2 N  floating  point  operations 
and  it  can  be  computed  in  a  simple  loop  of  length  N.  So  let  us  define  matrices 

Z  =  { z1,...,zp,yi)1  F=  (fi,....fP,e1)  £l’x(p+1) 

and  denote  Zk,*  as  a  fc-th  row  of  Z.  Now  we  can  find  the  solution  of  the  system 
LZ  =  F  using  the  formula 

(  0  for  k  <  0 

^  =  |  Fkt.  +  E  CLj zk-jt.  for  1  <  k  <  q.  (9) 

Initially  columns  of  the  matrix  F  can  be  stored  in  a  one-dimensional  array  x,  so 
Z  can  be  computed  using  the  following  code 

do  k=l,q 

do  j=l,min(m,k-l) 

call  saxpy(p+l,a(j) ,x(k-j) ,q,x(k)  ,q) 
end  do 
end  do 

It  can  be  easily  calculated  that  the  number  of  -AXPY  operations  is  equal  to 
m(q  _  m± i)  and  thus  the  total  number  of  operations  needed  to  find  vectors  zj 
and  y  can  be  expressed  as 

,  .  ,  m  +  1. 

Ci  -2{p  +  l)m(q - — )- 


(10) 
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As  soon  as  the  matrix  Z  is  calculated  its  last  column  ought  to  be  copied  to  a 
new  array  y  such  that  y(-m:0)  =0.0. 

call  scopy(q, x(p*q+l) , l,y(l) , 1) 
do  j=-m,0 

y(j)=o.o 

end  do 

Now  vectors  xjt  j  =  2 can  be  computed.  For  each  vector  we  should 
compute  coefficients  a;fc  using  the  following  code 

do  k=l,m 

call  saxpy(m+l-k,a(m+l-k) ,x(q*(j-l)-m+k) , 1 .alpha, 1) 
end  do 

and  then  find  Xj  using  a  sequence  of  _AXPY  calls 
do  k=l,m 

call  saxpyfq, alpha (k) ,y(2-k) , 1 ,x(q*(j-l)+l) , l) 
end  do 

The  total  number  of  floating-point  operations  in  this  part  of  the  algorithm  is 


Now  let  us  consider  possible  modifications  of  the  proposed  algorithm.  First, 
observe  that  the  last  step  of  the  algorithm  can  be  implemented  in  terms  of  level 
2  BLAS  using  one  call  of  _GEMV.  More  precisely,  when  we  form 

w~  (yi.--.,ym)  eK,xm  (12) 

then  instead  of  the  last  loop  above,  we  can  use 

call  sgemvC’N’ ,q,m, 1 ,w,ldw, alpha, 1, 1 ,x(q*(j-l)+l) , l) 

Note  that  the  use  of  _GEMV  requires  additional  space  for  qm  entries  of  W. 

Let  us  now  observe  that  for  finding  Z  we  can  consider  the  use  of  the  routine 
-TBTRS  from  the  LAPACK  library  [l]  which  solves  a  system  AX  -  B  where  A 
is  a  triangular  banded  matrix.  Thus  instead  of  the  sequence  of  _AXPY  calls  based 
on  (9)  we  would  have  the  following  LAPACK  call 

call  stbtrsf ’L’ , ’N’ , *U* , q,m,p+l,ab,ldab,x, q, info) 

We  have  to  recall  that  this  routine  does  not  take  into  account  the  Toeplitz 
structure  of  the  matrix  L  and  requires  additional  space  for  m  +  1  diagonals  of 
L,  i.e.  for  (m+  1  )q  additional  values. 

In  the  table  below  we  summarize  algorithms  that  can  be  used  to  solve  the 
original  problem  (1): 
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Algorithm 

Scalar 

Algorithm  1A 
Algorithm  IB 


Description _ _ _ 

Scalar  code  based  on  a  direct  implementation  of  (1) _ 

The  main  algorithm  based  on  calls  to  the  -AXPY  routine 
As  Algorithm  1A  but  the  last  step  is  calculated  by  one 


call  of  the  level  2  BLAS  routine  -GEMV _ 

Algorithm  2  The  system  LZ  —  F  solved  by  a  call  to  the  LAPACK  routine 
-TBTRS  and  the  last  step  calculated  by  the  call  to  -GEMV 
Algorithm  3  LAPACK  -TBTRS  routine  called  for  one  RHS _ I 


4  Performance  analysis 


To  study  the  performance  of  the  algorithm  let  us  consider  the  theoretical  model 
of  vector  computations  introduced  by  Hockney  and  Jesshope  [6,  2], 

The  performance  ry  of  a  loop  of  length  N  can  be  expressed  in  terms  of 
two  parameters  r^  and  'Ti\/2  which  are  specific  for  a  kind  of  loop  and  vector 
computer.  The  first  parameter  represents  the  performance  in  Mflops  for  a  very 
long  loop,  while  the  second  the  loop  length  for  which  a  performance  of  about 
is  achieved.  Then 


Cv  = 


P1/2/N  +  1 


Mflops. 


(13) 


This  yields  that  the  execution  time  of  _AXPY  is 

Taxpy(N)  =  =  2  ("1/2  +  N)  seconds.  (14) 

From  (10),  (11)  and  (14)  we  get  that  the  total  execution  time  of  our  algorithm 
can  be  estimated  as  follows 

T(p,  q )  =  — — — m  (2 pq  +  2n1/2p  +  n1/2q  -  2.bnXj2  -  0.5 -  m  -  l)  , 

r  00 

where  n  =  pq.  It  can  be  easily  verified  that  T(p.  q)  reaches  its  minimum  at  the 

point  _  _ 

(p,q)  =  (^n/2,V2n).  (15) 

Thus  the  optimal  choice  of  p  and  q  depends  only  on  the  problem  size  n  and 
because  these  numbers  should  be  integers  we  choose  q  =  \_\f2n\  and  p  =  [n/ q\ . 
Here,  the  last  n  -  pq  elements  of  the  solution  x  can  be  computed  by  a  scalar 
algorithm  based  on  (1). 

Sometimes  these  chosen  parameters  have  to  be  adjusted  to  avoid  memory 
bank  conflicts.  Vector  computers  usually  store  data  so  that  contiguous  words 
(e.g.  elements  of  arrays)  are  in  separate  memory  banks.  Usually  the  number  of 
banks  in  the  memory  system  is  a  power  of  two.  Memory  bank  conflicts  may 
occur  when  an  array’s  stride  (the  difference  in  the  index  between  two  successive 
iterations)  is  a  multiple  of  a  power  of  two.  Then  the  memory  cannot  be  efficiently 
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used  because  CPI  must  wait  until  a  former  memory  request  to  the  same  bank 
is  completed.  Thus  to  avoid  memory  bank  conflicts  the  parameter  q  should  be 
chosen  as  follows 

q={  -  1  if  Lv^J  even. 

\  [\/2nJ  otherwise. 

Finally  let  us  calculate  the  number  of  floating  point  operations  performed  by 
the  algorithm.  Adding  Cx  and  C2  defined  by  (10)  and  (11),  and  the  number  of 
flops  required  for  finding  the  last  n  -  pq  entries  of  the  solution  we  get 

C„,m  {p,  q)  =  Ci  +  C2  4-  m(n  -  pq  -  --y1)  =  3mpq  -  |  rn(m  +  1)  +  mn.  (17) 

5  Results  of  experiments 

The  method  has  been  implemented  in  FORTRAN  and  tested  on  a  single  proces¬ 
sor  of  the  Cray  C-90  and  SV-1  vector  computers.  We  have  used  the  optimized 
versions  of  BLAS  and  LAPACK  available  in  the  scilib  library.  Each  algorithm 
was  tested  varying  the  problem  sizes  n  and  m  and  values  of  parameter  q  CPU 
time  was  measured  using  the  second  function  and  the  presented  results  represent 
the  best  values  from  multiple  runs. 

Figures  1  and  2  illustrate  the  dependency  between  the  performance  of  Algo¬ 
rithm  1A  and  the  value  of  parameter  q  for  m  ~  1  and  n  =  64000  and  n  =  1024000 
respectively.  Results  for  both  Cray's  are  reported  in  M flops. 


Algorithm  1 A  for  n=64000,  m  =1 


Fig.  1.  Performance  of  Algorithm  1A  for  various  values  of  q 
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Algorithm  1 A  fo r  n  =  1 024000,  m  =1 


Cray  SV-1  -  -  -  -  Cray  C-90 


Fig.  2.  Performance  of  Algorithm  1A  for  various  values  of  q 


It  was  shown  above  (see  Section  4)  that  the  optimal  value  of  the  parameter  q 
depends  only  on  the  size  of  the  problem  n.  Our  experiments  support  this  claim 
and  show  that  this  result  holds  for  both  machines  (the  optimal  value  is  the 
same  on  both  Grays)  even  though  they  have  different  characteristic  parameters 
rx  and  n1/2.  The  experimental  optimal  value  of  q  has  been  found  to  be  in 
close  proximity  of  the  theoretically  predicted  one  (excluding  values  which  are 
powers  of  2  for  which  the  memory  bank  conflicts  affect  performance) .  Thus,  in 
computational  practice,  the  theoretically  predicted  optimal  value  of  q  can  be 
used  to  implement  the  code. 

Figures  3  and  4  depict  the  relationship  between  the  performance  (in  Mflops) 
and  the  size  of  the  problem  n  and  the  order  of  the  recurrence  m  (for  these 
experiments  the  theoretically  predicted  optimal  value  of  q  was  used).  In  Figure 
3  we  report  the  results  for  n  =  64000  and  rn  =  1, 2, . . . ,  6  for  both  Grays  and  all 
five  algorithms.  In  Figure  4  we  present  similar  results  for  n  =  1024000. 

First,  let  us  observe  that  the  qualitative  behavior  of  the  five  algorithms  is 
the  same  for  both  machines  and  is  independent  of  the  problem  size  n. 

For  m=  1  the  Algorithm  1A  is  the  most  efficient.  For  Algorithms  2  and 
3  a  performance  dip  manifests  itself  for  m  =  2.  Starting  from  m  =  2  further 
increase  in  m  results  in  the  performance  increase.  Interestingly,  for  all  values 
of  m,  Algorithms  2  and  3  which  utilize  LAPACK  library  routine  _TBTRS  are 
substantially  less  efficient  than  Algorithms  1A  and  IB  and  only  barely  more 
efficient  than  the  Scalar  code. 

As  m  increases,  Algorithm  IB  outperforms  Algorithm  1A.  This  can  be  ex¬ 
plained  as  an  effect  of  the  application  level  2  BLAS  matrix-vector  multiplication 
_GEMV. 

Finally,  note  that  the  performance  of  the  two  Grays  depends  on  the  problem 
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Cray  SV-1,  n=64000 


Cray  C-90,  n=64000 

4.0E+02 
3.5E+02 
3.0E+02 
2.5E+02 
2.0E+02 
1.5E+02 
1.0E+02 
5.0E+01 
O.OE+OO 


Fig.  3.  Performance  of  the  algorithms  for  various  m 


Cray  SV-1,  n=1 024000 


Cray  C-90,  n=1 024000 

4.0E+02 
3.5E+02 
3.0E+02 
2.5E+02 
2.0E+02 
1 ,5E+02 
1 ,0E+02 
5.0E+01 
0.0E+00 


Fig.  4.  Performance  of  the  algorithms  for  various  m 


size  ( n ).  For  small  n  Cray  C-90  matches  the  performance  of  the  newer  SV-1 
(for  m  =  6  it  even  outperforms  it  slightly).  The  situation  changes  radically  for 
n  -  1024ft" .  Here,  the  Cray  SV-1  is  almost  twice  as  fast  as  the  Cray  C-90. 

We  believe  that  from  the  point  of  view  of  the  user  one  of  the  more  interest¬ 
ing  parameters  is  the  speedup  of  the  “fancy”  algorithms  over  the  basic  Scalar 
approach.  We  illustrate  this  aspect  of  the  problem  in  Figures  5  and  6.  Here  we 
report  the  speedup  as  the  function  of  the  problem  size  n  for  both  machines  for 
m  -  1  and  m  =  4  respectively.  As  previously,  the  optimal  theoretical  value  q 
was  used  for  algorithms  1A,  IB  and  2. 
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Fig.  5.  Speedup  of  the  algorithms  for  various  n 


a 

3 

TJ 


6.40E+04 


1.06E+06  2,06E+06 


Fig.  6.  Speedup  of  the  algorithms  for  various  n 


As  previously,  the  results  are  qualitatively  similar  for  both  machines.  In  all 
cases  (independently  of  n)  Algorithms  2  and  3  do  not  result  in  a  significant 
speedup  over  the  Scalar  approach.  Interestingly,  while  as  n  increases  (for  a  fixed 
m,  speedup  of  Algorithms  1A  and  IB  over  Scalar  increases,  as  m  increases  (for 
a  given  m)  the  speedup  decreases.  This  indicates  that  the  code  generated  b> 
the  compiler  from  the  Scalar  algorithm  for  increasing  m  results  in  improved 
efficiency. 

Finally  let  us  summarize  the  results  of  experiments 
-  Algorithms  1A  and  IB  achieve  the  best  performance  for  values  of  the  param- 
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eter  q  close  to  the  theoretical  optimal  value.  The  optimal  choice  of  q  depends 
only  on  the  problem  size  (and  memory  bank  conflicts). 

-  The  use  of  Algorithm  IB  instead  of  1A  is  profitable  when  rn  >  2.  This  is 
caused  by  the  use  of  the  level  2  BLAS  routine  _GEMV.  However,  use  of  _GEMV 
requires  addit  ional  space  for  qm  entries  of  IF. 

The  speedup  of  Algorithms  1A  and  IB  over  the  Scalar  code  increases  when 
the  problem  size  n  increases  and  decreases  when  the  order  of  the  system  m 
increases. 

-  The  MFlop  performance  increases  when  the  problem  size  n  increases  as  well 
as  when  the  order  of  the  system  m  increases. 

-  When  q  =  a2k  (for  integer  a.  k ),  performance  rapidly  decreases.  Increase  in 
the  value  of  k  results  in  further  substantial  performance  degradation.  This 
is  the  effect  of  memory  bank  conflicts. 

The  performance  of  Algorithm  2  and  3  is  rather  poor  and  the  algorithms 
require  additional  space.  This  is  a  result  of  the  fact  that  the  _TBTRS  routine 
from  LAPACK  solves  more  general  problem  (2)  and  does  not  utilize  the 
special  Toeplitz  structure  of  the  matrix  L. 

-  For  first  order  linear  recurrences  (m  =  1)  Algorithm  1A  is  approximately 
six  times  faster  then  the  Algorithms  2  and  3  which  use  _TBTRS  routine 
from  LAPACK  and  for  large  n  achieves  speedup  up  to  60  against  the  Scalar 
algorithm  based  on  (1). 
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Abstract.  This  paper  analyzes  the  parallel  performance  of  a  numerical  solver  for 
discrete-time  periodic  Riccati  equations.  The  approach  performs  a  sequence  of  or¬ 
thogonal  reordering  transformations  of  the  monodromy  matrices  associated  with 
the  equations,  and  then  employs  the  so-called  matrix  disk  function  to  solve  a  series 
of  discrete-time  algebraic  Riccati  equations.  The  experimental  results  report  the 
performance  of  the  parallel  algorithms  on  a  cluster  of  Intel  Pentium-II  processors. 


1  Introduction 

Consider  the  discrete-time  linear  systems 

Xk+i  —  Akxk  +  BkUk,  xo  —  x ,  ^ 

J/fc  :  Ckxk , 

k  =  0, 1, . . where  Ak  G  Rnxn,  Bk  G  RnXTn,  and  Ck  G  !Rrxn.  Discrete-time 
periodic  systems  satisfy  Ak+P  =  Ak,  Bk+P  =  Bk,  Ck+P  =  Ck,  for  some  integer 
period  p.  The  analysis  and  design  of  these  systems  has  received  considerable 
attention  in  recent  years  (see,  e.g.,  [7,9,19,20,23]). 

An  important  application  in  control  theory  is  the  linear-quadratic  opti¬ 
mal  control  problem.  The  solution  of  this  problem  is  intrinsically  related  to 
the  unique  periodic  symmetric  positive  semidefinite  solution,  Xk  =  Xk+P  £ 
]R”xn,  of  the  discrete-time  periodic  Riccati  equation  (DPRE) 

0  =  CkQkCk  -  Xk  +  A^Xk+\Ak  m) 

—  Aj  Xk+iBk(Rk  +  Bj  Xk+iBk)  1  Bj  Xk+\  Ak. 

Here,  Qk  =  Qk+P  G  Rrxr  is  a  positive  semidefinite  matrix  of  weights  for  the 
outputs,  and  Rk  =  Rk+P  G  Rmxm  is  a  positive  definite  matrix  of  weights  for 
the  inputs  (see  [7]  for  details).  In  case  p=  1,  the  DPRE  in  (2)  reduces  to  the 

*  Supported  by  the  Conselleria  de  Cultura,  Educacion  y  Ciencia  de  la  Generalidad 
Valenciana  GV99-59-1-14  and  the  Fundacio  Caixa-Castello  Bancaixa. 


-  553  - 


FEUP  -  Faculdade  de  Engenharia  da  Universidade  do  Porto 


well-known  discrete-time  algebraic  Riccati  equation  (DARE)  [15].  Traditional 
DARE  solvers  are  described,  e.g.,  in  [14-16,18]. 

Consider  now  the  periodic  symplectic  matrix  pencil,  associated  with  the 


Am  0 
-CjQkCk  In 


In  BkR~lBj 

0  Aj 


=  Lk+p  —  A  Mk 


where  /„  denotes  the  identity  matrix  of  order  n.  In  case  Ak  is  invertible,  it 
is  possible  to  construct  the  periodic  monodromy  matrix  [9], 


Bk  Mk+p_lLk+p_1  ■  ■  ■  Mk  1Lk,  nk  =  nk+p,  (4) 


and  the  solution  of  the  DPRE  can  then  be  obtained  by  a  spectral  division 
technique  [10,13,17],  Unfortunately,  this  is  not  a  practical  approach  as  a  con¬ 
siderable  loss  of  accuracy  can  be  expected  in  case  any  of  the  inverses  in  (4) 
is  ill-conditioned  [11]. 

The  Schur  vectors  method  [2,12,15]  was  successfully  extended  in  [9,11] 
for  solving  DPRE,  without  explicitly  forming  the  corresponding  monodromy 
matrices.  In  this  method,  a  periodic  Schur  form  of  the  monodromy  matrix 
is  computed  with  a  special  ordering  of  the  eigenvalues.  However,  the  parallel 
implementation  of  this  type  of  algorithms  (e.g.,  the  QR/QZ  algorithms)  ren¬ 
ders  a  poor  scalability  and  an  efficiency  far  from  those  of  traditional  matrix 
factorizations  such  as,  e.g.,  LU  decomposition  [8]. 

In  this  paper  we  follow  a  different  approach,  described  in  [5],  for  the 
solution  of  DPRE.  The  algorithm  employs  a  reliable  swapping  of  the  matrix 
products,  in  (4)  to  transform  the  DPRE  to  p  DAREs.  We  then  employ  the 
matrix  disk  function  to  obtain  the  corresponding  solutions  [4]. 

In  sections  2  and  3  we  briefly  review,  respectively,  the  “swapping”  method 
for  solving  DPRE  and  the  matrix  disk  function  for  solving  DARE.  In  section  4 
we  describe  the  parallel  implementations  of  the  algorithms.  Our  medium- 
grain  parallel  approach  requires  efficient  parallel  implementations  of  two  nu¬ 
merical  kernels  provided,  e.g.,  in  ScaLAPACK  [8],  In  section  5  we  report  the 
performance  of  the  parallel  implementations  on  a  cluster  of  Intel  Pentium- II 
processors,  connected  via  a  Myrinet  switch  crossbar  network.  Our  concluding 
remarks  are  presented  in  section  6. 


2  The  Swapping  Method  for  the  DPRE 

In  [5]  an  algorithm  is  described  for  solving  of  DPRE  without  explicitly  form¬ 
ing  the  monodromy  matrices.  The  approach  relies  on  the  use  of  the  following 
lemma. 

Lemma.  Consider  the  matrix  pair  (Z,Y),  Z,  Y  6  Rnxn.  IfY  is  invertible 
and 
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is  a  QR  factorization  of[YT,-ZT]T,  then  Q22Q21  =  Z\  . 

This  lemma  is  applied  in  [5]  to  reorder  the  matrix  products  in  (4).  The 
goal  is  to  obtain  a  matrix  product  of  the  form 

nk  =  M^Lk,  (6) 

without  computing  the  inverses.  The  solutions  of  the  DPRE  are  then  com¬ 
puted  by  solving  the  corresponding  DAREs. 

Specifically,  the  method  proceeds  as  follows.  Consider  p  -  3,  the  mon- 
odromy  matrix 

n0  =  M^L-iM^LiM^Lo,  (7) 

and  apply  the  swapping  to  matrix  pairs 

{L2,Mi),{Li,M0),  and  (L0,M2).  (8) 


(Notice  that  the  same  matrix  pairs  also  arise  in  77i  and  772.)  Then,  we  obtain 


which  satisfy 


Therefore, 


(41),Ml(1)),(41\Mo1)),  and 

(41),A7<1)) 

.  (9) 

7 

XT' 

ii 

iM 

^4 

A1A70-1  =  (A7'1))-LL(11), 
LoA/-1  =  (Afi1,)-l41). 

and 

(10) 

77o  =  A72-1(A71(1))-141)(A/^1) 

)-141)a0, 

(11) 

EUIU  OilliilCU-  vuwuiiiw  --1  — -  -  —J  i -  .  * 

ping  procedure  with  the  matrix  pair  Mq1))  we  obtain  ( L.{  ,  M0  )  such 


1  -  (M,$2))_1  L<2)  and  the  required  reordering  for  770  is  ob- 

L0.  (12) 


that  41)(Mo1)) 
tained 

770  =  (a^2)  M[ l)  M2)  (Lf  41  ’  ■ Lo)  =  M( 


-1 

0 


Similar  reorderings  are  obtained  for  77i  and  772. 
The  algorithm  can  be  stated  as  follows  [5]. 
for  k  =  0, 1, . . .  ,p  —  1 

Set  Lk  —  Lk  ,  A7(fc+i)  mod  p  —  A/fc 


end 

for  t  =  1, 2, . . .  ,p  —  1 

for  k  =  0, 1, . . .  ,p  —  1 

(L(k+t)  mod  p;  Affc)  4-  SWdp(L(k+t)  mod  p>  A Ik) 
Tfc  4-  L(fc+t)  mod  pAfc 

M(k+t+l)  mod  p  ^  MkM(k+t+\)  mod  p 


end 


end 

The  matrix  products  77 k  can  still  be  formed  in  a  formal  way  and  reveal  the 
monodromy  relation  if  (some  of)  the  Ajds  are  singular;  see  [6,22]  for  details. 
The  computational  cost  of  the  reordering  algorithm  is  34p(p  -  l)n3/3  flops 
(floating-point  arithemtic  operations)  and  0{pn2)  for  workspace. 
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3  The  Matrix  Disk  Function  for  the  DARE 

In  [13],  -Malyshev  introduced  an  “inverse  free  iteration”  for  computing  the 
right  deflating  subspace  of  a  matrix  pair.  The  method  was  refined  and  made 
truly  inverse  free  in  [3],  and  was  further  improved  in  [21]. 

Given  a  regular  matrix  pair  (Mo,  So)  =  (A,B),  with  A,  B  £  ]Rnx”,  the 
inverse-free  iteration  generates  the  sequence  of  matrix  pairs 


Ak+i  —  Q2iAk, 
Bk+i  =  Qv-iBk, 


with 


Q11 

Q 12 

'  Bk  ‘ 

Rk 

Q21 

Q22 

-M,_ 

0 

(14) 


a  QR  factorization  of  [Bj,  -.4jT]T. 

In  case  this  iterative  scheme  is  applied  with  the  initial  2 n  x  2 n  matrix 
pair  (A0,  B0)  =  ( Lk,Mk ),  the  solution  X*  of  the  associated  DARE  can  be 
obtained  from  the  converged  matrix  =  lim*^  Ak  as  follows.  Let 


L 


OO 


L\\  L12 

L‘1\  L-2 2 


(15) 


be  annxn  partition  of  L, Then,  Ar*  is  the  solution  of  the  full-rank  linear 
least-squares  problem 


L12 

L-22 


X * 


I'll 

Ltl 


(16) 


see  [5]  for  details. 

The  cost  of  solving  a  DARE  using  the  inverse  free  iteration  for  the  matrix 
disk  function  is  13n3  flops  per  iteration  of  (13)-(14),  13n3/3  for  the  LLS 
problem  in  (16),  and  0(n2)  for  workspace. 


4  Parallel  Algorithms 

Two  approaches  are  possible  for  parallelizing  the  previous  DPRE  solver  on  a 
parallel  distributed-memory  system.  First,  in  case  p  is  large  compared  to  the 
number  of  nodes,  a  coarse-grain  strategy  can  be  employed.  In  this  case  each 
swapping  of  a  pair  of  matrices  (i.e.,  each  QR  factorization)  is  performed  on  a 
different  node  of  the  system,  and  each  DARE  is  finally  solved  on  a  different 
node.  The  communications  can  be  arranged  so  that  a  ring  topology  is  suf¬ 
ficient  (see  [5]  for  details).  This  algorithm  only  requires  tuned  send/receive 
communication  routines,  an  efficient  numerical  kernel  for  the  QR  factoriza¬ 
tion,  like  that,  e.g.,  in  [1],  and  a  serial  implementation  of  the  inverse  free 
iteration  for  the  matrix  disk  function. 
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Nevertheless,  in  case  the  number  of  nodes,  np,  is  larger  than  the  period 
of  the  system  (as  can  be  expected  in  large  multicomputers),  in  a  coarse- 
grain  algorithm  part  of  the  nodes  of  the  system  would  be  idle.  Thus,  in  such 
case  it  is  more  efficient  to  perform  each  swapping  in  parallel.  This  medium- 
grain  approach  benefits  from  the  existence  of  parallel  linear  algebra  libraries, 
as  ScaLAPACK  [8],  which  implements,  among  others,  a  parallel  kernel  for 
the  QR  factorization.  By  sometimes  performing  the  swapping  algorithm  on 
slightly  larger  matrices,  of  the  form  [A7’,  0^xA:,  BT]T ,  we  avoid  the  redis¬ 
tribution  of  the  matrices  that  would  be  necessary  to  combine  different  pairs 
of  matrices.  After  the  swapping  stage,  the  DARE  are  solved  using  a  parallel 
ScaLAPACK-like  implementation  of  the  matrix  disk  function. 


5  Experimental  Results 


All  the  experiments  were  performed  on  a  cluster  of  Intel  Pentium-II  processors 
connected  via  a  Myrinet  switch,  using  IEEE  double  precision  floating-point 
arithmetic  (e  ss  2.2  x  1CT16).  A  BLAS  implementation  specially  tuned  for 
this  architecture  was  employed.  Performance  experiments  with  routine  DGEMM 
achieved  200  Mflops  (millions  of  flops  per  second)  on  one  processor. 

Our  first  experiment  reports  the  execution  time  the  parallel  DPRE  solver, 
using  rip— A.,  9,  and  16  nodes.  Specifically,  in  Figure  l(a)-(c)  we  show  the  ex¬ 
ecution  time  of  the  parallel  implementation  of  the  swapping  method  (routine 
PDGGSWP)  for  DPRE  with  periods  p= 2,  4,  and  10.  In  Figure  1(d),  we  report 
the  execution  time  of  the  DARE  solver  based  on  the  inverse  free  iteration  for 
the  matrix  disk  function  (routine  PDGGDSK),  required  in  the  final  stage  of  the 
algorithm. 

Figure  2  analyzes  the  scalability  of  the  parallel  routines.  For  this  purpose, 
we  report  the  Mflops  rate  per  node  for  PDGGSWP  and  PDGGDSK  with  n/ N fn p 
fixed  at  450  and  750,  respectively.  The  constant  performance  of  the  Mflops 
rate  shows  the  high  scalability  of  both  algorithms. 


6  Concluding  Remarks 

We  have  investigated  the  performance  of  a  parallel  numerical  solver  for  discrete¬ 
time  periodic  Riccati  equations.  The  algorithm  performs  a  sequence  of  or¬ 
thogonal  reordering  transformations  of  the  monodromy  matrices  associated 
with  the  equations,  and  transforms  the  problem  to  a  series  of  discrete-time 
algebraic  Riccati  equations,  which  are  then  solved  by  using  the  matrix  disk 
function.  Experimental  results  on  a  cluster  of  Intel  Pentium-II  processors 
report  a  high  performance  and  scalability  of  our  parallel  algorithms. 


-557- 


FEUP  -  Faculdade  de  Engenharici  da  Universidade  do  Porto 


140 

(V 

E  120 

.1  ’00 

3 

8  00 
HI 

60 


Problem  size  (n) 

(b)  PDGGSWP,  p 


1000  1500  2000 

Problem  size  (n) 

(d)  PDGGDSK 


Fig.  1.  Execution  times  of  the  parallel  routines. 


Fig.  2.  Scalability  of  the  parallel  routines. 
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Abstract  The  history  of  the  development  of  parallel  computation 
methodology  is  closely  linked  with  the  development  of  techniques  for  the 
computer  processing  of  images.  In  the  early  60s,  research  in  high  energy 
particle  physics  began  to  generate  extremely  large  numbers  of  particle  track 
photographs  to  be  analysed  and  attempts  were  made  to  devise  automatic  or 
semiautomatic  systems  to  carry  out  the  analysis.  This  stimulated  the  search  for 
ways  to  build  computers  of  increasingly  higher  performance  since  the  size  of 
the  image  data  sets  exceeded  any  which  had  previously  been  processed.  At  the 
same  time,  interest  was  growing  in  exploring  the  structure  of  the  human  visual 
system  and  it  was  felt  intuitively  that  image  processing  computation  should  bear 
at  least  some  resemblance  to  its  human  analogue. 


This  review  paper  traces  the  simultaneous  progress  in  these  two  related  lines  of 
research  and  discusses  how  their  interaction  influenced  the  design  of  many 
parallel  processing  computers  and  their  associated  algorithms. 


1.  Thirty  years  ago 

Image  Processing  was  originally  regarded  as  a  subset  of  the  wider  field  of  Pattern 
Recognition  which  dealt  with  the  analysis  and  processing  of  patterns  in  sound  and 
other  signal  sources  such  as  ECG  and  EEG  as  well  as  images.  In  all  these  areas,  the 
research  was  mainly  application  driven.  A  three-day  meeting  in  London  in  1968, 
organised  by  the  Institution  of  Electrical  Engineers  and  entitled  ‘Conference  on 
Pattern  Recognition’,  comprised  37  papers.  Of  these,  approximately  one  third  were 
devoted  to  Optical  Character  Recognition  (OCR)  and  a  quarter  to  the  physiology  or 
psychology  of  human  vision;  the  remainder  was  distributed  more  or  less  equally 
between  studies  of  learning  algorithms,  speech  recognition  and  general  problems  in 
pattern  recognition.  At  this  early  stage,  although  it  was  realised  that  the  principal 
application,  OCR,  would  eventually  demand  much  higher  processing  power  than  was 
currently  available,  the  lack  of  effective  algorithms  meant  that  research  was  directed 
towards  how  to  recognise  images  rather  than  to  doing  so  at  economic  speeds. 

Even  so,  what  was  not  realised  was  how  difficult  the  task  would  be.  There  was  a 
quite  unjustifiable  optimism  amongst  researchers  which  could  probably  be  excused  by 
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the  fact  that  everyone  could  observe  in  action  (and,  in  fact,  owned)  a  very  effective 
image  processing  system  which  was  portable,  low  power,  high  resolution  and  able  to 
work  in  an  unconstrained  environment.  Colour  analysis,  stereoscopy,  time  sequence 
analysis  automatic  compensation  for  high  or  low  light  conditions,  rotation  invariance 

d?ffVnl!  §metnt  ?f°gnit,on’  leammg  capability:  the  system  coped  with  all  these 
difficult  aspects.  Unfortunately,  it  was  considered  that  a  combination  of  intuition  and 

Z°ZTrU]?  fmeh™  rf'eal  how  the  human  vision  system  was  constructed 
and  that  this  knowledge  could  then  be  translated  into  an  appropriate  combination  of 

hardware  and  software.  This  would  amount  to  more  than  a  PhD  project  but  certainly 
should  not  take  as  long  as  ten  years  to  complete.  y 

In  this  optimistic  atmosphere,  there  were  two  factors  which  stimulated  an  interest  in 
faster  computation.  First,  it  seemed  likely  that  useful  algorithms  would  soon  be 
developed  and  that  computers  would  then  need  to  be  made  much  more  powerful  in 
order  to  achieve  acceptable  processing  rates.  Second,  the  progress  being  made  in 
designing  algorithms  was  poor,  at  least  in  part  due  to  the  inefficient  computing 
services  currently  available.  For  example,  at  University  College  London  in  the  early 
60s,  a  large  mainframe  machine  (IBM  360)  provided  the  central  computing  service 
rograms  and  even  test  images  were  entered  via  punched  cards  and  then  batch 
processed.  Typically,  a  print-out  of  the  results,  using  overprinted  characters  to 
represent  image  intensities,  would  be  obtained  on  the  following  day;  any  small 
programming  error  (such  as  an  unwanted  comma)  added  a  further  day's  delay  to  the 
program  development  time.  In  this  virtually  non-interactive  environment,  thinking 
constructively  about  algorithm  design  was  almost  painful. 

Optimistic  or  not,  almost  all  who  were  engaged  in  image  processing  research  agreed 
that  faster  computers  would,  sooner  or  later,  need  to  be  developed  and  that  there 
would  be  an  immediate  advantage  if  computing  speeds  could  be  improved.  The 
important  question  was:  how  could  a  speed  gain  be  achieved? 


2.  Faster  computing 

From  the  outset,  it  was  clear  that  there  were  only  three  ways  to  speed  up  computing 
They  were  (and  still  are): 

a)  More  efficient  programming; 

b)  Use  of  faster  components; 

c)  Improved  system  hardware  architecture. 

With  large  data  sets  to  be  processed,  it  is  extremely  important  to  optimise  the  pieces 
of  code  in  the  so-called  inner  loops.  For  example,  if  the  intensity  of  every  pixel  in  an 
image  is  to  be  averaged  with  its  neighbours,  then  the  code  performing  the  averaging 
may  be  executed  a  million  times  in  a  typical  size  image.  Any  wasted  operations  in 
that  section  of  code  will  severely  affect  the  overall  efficiency  of  the  program.  It  goes 
without  saying  that  experienced  programmers  would  not  be  expected  to  make  this  sort 
of  eiror.  In  general,  it  would  be  hoped  that  most  of  the  gains  which  could  be  obtained 
by  efficient  programming  would  normally  already  have  been  made. 
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Speeding  up  computers  by  using  faster  components  is  a  continuous  process  of 
technological  development  which  is  largely  under  the  control  of  computer 
manufacturers.  In  the  period  we  are  discussing,  computing  component  technology 
moved  from  thermionic  valves,  through  transistors  to  integrated  circuits,  having 
already  progressed  from  mechanical  (gear  wheels)  and  electromechanical  (relay) 
computation.  In  the  last  phase,  integrated  circuits  have  also  undergone  massive 
improvements  in  level  of  integration  (numbers  of  components  per  unit  area)  and 
semiconductor  technology,  both  of  which  have  produced  enormous  speed  gains.  For 
the  typical  researcher,  access  to  the  best  available  circuit  components  has  usually  been 
a  matter  of  cost  since  all  new  devices  tend  to  be  prohibitively  expensive  when  first 
introduced. 

The  third  approach  is  to  redesign  the  computer  architecture.  The  underlying  structure 
of  all  computers  was  once  much  the  same:  there  was  a  store  for  instructions,  a  store 
for  data  and  a  processor  which  was  controlled  by  instructions  extracted  from  the 
program  store.  These  acted  on  data  from  the  data  store,  producing  a  result  which  was 
returned  to  the  data  store.  There  were  also  units  which  input  and  output  data  and 
programs.  A  master  controller  ensured  that  all  these  operations  were  correctly 
sequenced.  This  extreme  oversimplification  hides  all  the  ingenuity  which  went  into 
making  these  basic  operations  efficient  and  transparent  to  the  programmer. 

Starting  with  this  fundamentally  simple  architecture,  the  challenge  was  to  make 
changes  which  would  improve  performance  not  marginally  but  substantially,  ideally 
by  many  orders  of  magnitude.  This  was  the  impetus  behind  the  introduction  of 
Parallel  Processing. 


3.  The  Concept  of  Parallel  Processing 

Many  hands  make  light  work  is  a  well  known  saying,  but  then  so  is  Too  many  cooks 
spoil  the  broth.  The  fact  is  that  increasing  the  size  of  the  work  force  does  not 
necessarily  reduce  the  time  (or  cost)  for  completing  a  task.  The  introduction  of 
additional  labour  implies  a  degree  of  organisation  and  co-ordination  and  may  also 
require  the  task  to  be  split  up  into  manageable  portions.  The  overhead  for 
organisation  can  be  more  than  the  time  saved  and  the  task  may  not  respond  well  to 
division.  How  often  does  one  hear  the  comment:  "I  don't  think  you  can  help  me;  it 
will  be  quicker  if  I  do  it  myself!"? 

The  central  challenge  in  the  design  of  parallel  computers  is  to  assemble  many 
computers  (or  processors)  into  a  system  which  will  then  share  the  execution  of  a 
program  in  such  a  way  that  the  time  between  the  start  and  end  of  the  whole  process  is 
reduced.  Ideally,  if  N  computers  are  used  to  execute  a  program  then  the  execution 
time  Tn  should  be  (1/N)T„  where  T,  is  the  time  taken  by  a  single  computer  to  execute 
the  same  program  (suitably  rewritten  for  a  single  computer).  In  practice,  this  ideal  is 
seldom  achieved,  the  exception  being  in  computers  designed  for  specific  algorithms. 
A  crude  measure  of  efficiency  of  a  parallel  architecture  is  T/(NTn),  but,  as  will  be 
discussed  in  more  detail  later,  this  measure  will  depend  on  the  program  being 
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executed,  both  in  relation  to  the  task  being  performed  and  to  the  skill 
programmer. 


of  the 


4.  Classifying  Parallel  Architectures 


In  general,  a  parallel  computer  will  consist  of  an  assembly  of  simple  computers 
usually  referred  to  as  processing  elements  (PEs).  Each  PE  may  be  extremely  simple 
perhaps  only  capable  of  processing  single  bit  data,  but  might  alternatively  be 
complex,  such  as  a  PC.  There  will  usually  be  memory  assigned  to  each  PE  and  an 
interconnection  network,  both  for  transmitting  data  between  PEs  and  for  supplying 
instructions  to  the  PEs.  Some  systems  operate  under  the  control  of  one  master 
computer  whereas  others  assign  partial  or  even  total  autonomy  to  each  PE. 


In  the  past  three  decades,  much  has  been  written  about  the  many  different 
architectures  of  parallel  processing  computers  and  many  attempts  have  been  made  to 
devise  a  taxonomy  for  classifying  the  architectures  (e.g.,  see  [9]).  The  best  known 
attempt  was  by  M  J  Flynn  [8]  whose  classification  was  based  on  whether  the  data 
stream  was  single  or  multiple  and  on  whether  the  instruction  stream  was  single  or 
multiple.  Of  the  four  possible  classes,  the  one  that  most  aptly  fitted  a  representative 
group  of  parallel  processing  computers  (several  of  which  were  actually  constructed) 
was  the  SIMD  class:  an  array  of  simple  PEs  all  simultaneously  executing  the  same 
instruction  (Single  Instruction  stream),  but  each  operating  on  its  own  part  of  the  data 
(Multiple  Data  stream).  However,  despite  the  fact  that  the  paper  describing  this 
taxonomy  has  been  quoted  in  the  literature  more  than  has  any  other  on  this  topic  this 
division  of  parallel  processors  into  four  classes  is  so  crude  as  to  be  virtually  useless. 

any  parallel  systems  either  do  not  fall  convincingly  into  any  of  the  classes  or  else 
equally  well  fall  into  more  than  one.  Furthermore,  the  first  class  (Single  Instruction 
stream,  Single  Data  stream)  refers  to  serial  computing  so  can  hardly  be  treated  as  part 
of  the  taxonomy. 

It  is  therefore  not  unreasonable  to  ask  why  researchers  persist  in  attempting  to  devise 
classification  schemes.  There  are  probably  two  main  reasons: 

Divide  and  conquer  Computer  scientists  (and  others)  have  experienced  great 
difficulty  in  understanding  the  underlying  principles  of  parallel  processing  systems 
and  it  can  be  a  help  if  the  structure  of  each  system  is  compared  with  one  of  several 
archetypes:  a  form  of  learning  by  analogy; 

Establishing  design  objectives  Parallel  computer  designers  need  to  be  clear  what 
their  strategy  will  be  when  designing  a  new  system.  It  can  be  a  useful  design 
discipline  to  encapsulate  a  strategy  by  naming  and  defining  the  broad  principles 
governing  each  particular  design. 

For  the  remainder  of  this  review,  classification  schemes  will  not  be  considered, 
especially  as  there  is  now  little  or  no  agreement  as  to  which  scheme  should  be 
adopted. 
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5.  Parallel  Processing  Fundamentals 


5.1  Three  Level  Processing 

It  is  easy  to  state  in  imprecise  terms  what  is  required  of  any  parallel  processing 
system.  It  is  a  system  which,  by  employing  more  than  one  processor,  completes  a 
data  processing  task  faster  than  could  be  achieved  by  a  single  processor.  In  order  to 
investigate  parallel  architectures,  the  following  discussion  will  concentrate  on  the 
particular  problems  associated  with  image  processing.  Examining  the  problems  in 
detail,  certain  significant  factors  begin  to  emerge: 

Data  type  Image  data  usually  consist  of  large  regular  arrays  of  square  picture 

elements  (pixels),  each  of  which  represents  the  local  brightness  and,  possibly,  colour 
of  the  image.  Typically,  each  pixel  is  assigned  a  1-bit  integer  (black  and  white  so- 
called  binary  images),  an  8-bit  integer  (grey-level  images)  or  a  24-bit  integer  (colour 
images).  An  image  of  approximately  domestic  television  resolution  (512  x  512 
pixels)  comprises  rather  more  than  one  quarter  of  a  million  pixels.  Very  many  image 
processing  operations  involve  replacing  each  pixel  by  a  new  pixel  whose  intensity  is  a 
function  of  the  intensities  in  a  defined  neighbourhood,  for  example,  the  3x3  pixel 
region  surrounding  each  pixel.  This  implies  that  an  image  processing  operation  can 
involve  over  2.5  million  basic  operations  (each  requiring  fetching  data  from  memory, 
computing  a  sum  or  product  and  then  storing  the  result  in  memory).  The  need  for  fast 
processing  is  self  evident. 

Computation  type  It  is  clear  that  the  highly  repetitive  nature  of  the  elements 

of  the  image  processing  computation  might  offer  potential  for  structuring  a  computer 
architecture  so  as  to  take  advantage  of  the  repetitiveness. 

Unfortunately,  this  brief  analysis  of  image  processing  greatly  oversimplifies  the 
situation.  Conventionally,  the  complex  task  of  image  processing  is  divided  into  three 
stages  or  levels  [23]: 

a)  Low  level  processing  which  is  characterised  by  taking  in  one  or  more 
images,  processing  them  and  outputting  one  or  more  result  images.  In  general,  the 
dimensions  of  the  input  and  output  data  arrays  will  be  identical; 

b)  Intermediate  level  processing  in  which  the  input  data  will  be  one  or  more 
images  (input  from  the  low  level  processing  stage)  and  the  output  data  will  be  one  or 
more  dimensionally  smaller  data  sets,  such  as  lists  of  detected  object  features  and 
global  properties  of  the  image  (e.g.  average  intensity,  histograms,  contrast  range). 

c)  High  level  processing  which  attempts  to  extract  meaning  from  the 
intermediate  level  data  with  a  view  to  describing  and  analysing  the  input  image.  The 
output  data  might  be  as  small  as  a  single  word  or  sentence. 
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5.2  Processor  Arrays 


As  was  discussed  earlier,  many  low  level  image  processing  tasks  can  be  broken  down 
into  identical  short  sequences  of  basic  operations,  each  centred  on  every  pixel  in  the 
image.  An  image  architecture  closely  matching  the  apparent  requirements  of  this 
ve  would  therefore  be  an  array  of  very  simple  processors,  each  associated  with  a 
single  pixel  and  each  accessing  data  only  from  its  own  local  memory  or  from  the 
neighbouring  set  of  pixels.  The  repetitive  nature  of  the  processes  to  be  performed 
would  permit  broadcasting  a  sequence  of  instructions  to  each  simple  processor  (PE) 

?^nStmCLtl0nS  bemg  then  executed  simultaneously  by  every  PE.  This  is  the  classic 
bIMD  architecture. 


Fig.  1  A  4x4  PE  array,  showing  the  interconnections  between  PEs  and  the  bus  distributing 
instructions  in  parallel  to  each  PE  s 

Apart  from  the  paths  taken  by  the  instructions,  all  communication  paths  in  the 
array  are  short  (l.e.  to  nearest  neighbours),  provided  that  local  memory  is  associated 
with  every  PE.  One  (Either  set  of  longer  paths  is  needed  to  input  or  output  data  to  the 
memory  array  but  these  could  be  routed  along  the  instruction  highway  Fig  1 
illustrates  the  main  features  of  a  4  x  4  PE  array 

Architectures  of  this  type  would  appear  to  be  ideal  for  low  level  processing  but 
present  many  difficult  problems  in  software  design.  Nevertheless,  it  can  be  shown 
that  arrays  of  very  simple  PEs  are  theoretically  capable  of  performing  all  image 
processing  operations  (even  including  those  classified  as  intermediate  or  high  level 
although  these  might  not  be  executed  very  efficiently). 
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One  or  more  loosely  coupled  conventional  processors  can  efficiently  handle  high 
level  processing.  There  is  no  general  pattern  to  the  type  of  operations  to  be  performed 
nor  to  the  various  types  of  input  data  set  and  the  fastest  available  high  speed 
workstation  or  even  PC  would  usually  offer  the  best  solution.  The  same  computer 
would  probably  be  used  to  control  the  other  two  levels  of  the  composite  system. 

The  most  difficult  stage  to  implement  is  the  intermediate  level.  By  definition,  the 
input  data  impose  requirements  similar  to  those  for  the  low  level  but  the  need  to 
abstract  information  derived  from  all  parts  of  the  image  (or  images)  implies  the  need 
for  efficient  connection  paths  across  the  whole  of  the  image  array.  It  would  also  seem 
likely  that  an  array  of  simple  PEs  would  not  represent  an  ideal  structure  for 
computing  histograms  and  other  results  contained  in  comparatively  small  data  sets. 
Optimisation  is  therefore  difficult  and  likely  to  be  specific  task  dependent. 

A  further  problem  resulting  from  the  splitting  of  the  low  and  intermediate  levels  is  the 
difficulty  in  transferring  the  multiple  image  data  between  the  two  levels.  Unless  this 
can  be  achieved  using  many  parallel  paths,  ideally  one  for  each  pixel,  then  this 
process  might  prove  to  be  the  bottleneck  for  the  whole  system. 

Taking  these  two  factors  into  consideration,  there  would  seem  to  be  good  arguments 
for  recombining  the  low  and  intermediate  levels,  enhancing  the  low  level  structure  by 
adding  good  communication  paths  between  all  parts  of  the  array  of  PEs. 

In  summary,  the  final  assembly  would  comprise  just  two  levels:  the  low/intermediate 
level  would  be  an  array  of  PEs,  one  per  pixel  for  the  size  of  image  to  be  processed, 
and  the  high  level/controller  would  be  a  conventional  workstation  or  high 
performance  PC. 


5.3  Pipeline  Processors 

In  the  discussion  in  the  previous  section  it  was  tacitly  assumed  that  the  task  presented 
was  to  process  a  single  image.  Parallelism  was  achieved  by  assigning  PEs  to  each 
part  of  the  image  data  (i.e.  to  each  pixel).  An  alternative  approach  can  be  adopted 
when  many  images  are  to  be  processed  in  a  sequence.  Under  these  circumstances, 
each  processor  is  given  a  particular  operation  to  perform  and  the  sequence  of  images 
is  fed  through  a  string  of  processors,  the  output  for  the  one  providing  the  input  for  the 
next.  The  processors  thus  constitute  a  pipeline  and  the  parallelism  is  now  function 
parallelism  rather  than  data  parallelism  (as  was  employed  in  the  processor  array). 
Sternberg  has  built  and  marketed  several  pipeline  processors  (named  Cytocomputers) 
and  developed  complex  software  to  program  them  [22]. 


Function  I 


Function  2 


Function  3 


Function  4 
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Fig.  2.  A  short  pipeline  processor  with  4  PEs  and  a  master  controller 


cta^S-  11  c  Interesnn§  t0  note  that  this  type  of  computer  might  also  be  classified 
as  MMD  in  that  each  PE  executes  a  single  instruction  on  multiple  data,  although  in 
this  case  the  data  is  multiple  in  time  rather  than  position.  In  that  the  Flynn  system  of 
classification  appears  not  to  distinguish  between  these  two  very  different 
architectures,  it  would  seem  to  be  of  little  practical  use. 

Because  the  operations  each  PE  performs  on  the  image  as  it  passes  through  it  can  be 
quite  complex,  a  pipeline  PE  will  usually  be  much  more  powerful  than  those  utilised 
in  processor  arrays.  A  further  consideration  is  that  cost  and  program  structure 
combine  to  make  it  unprofitable  to  construct  very  long  pipelines;  instead,  it  is  more 
efficient  to  cycle  each  stream  of  images  several  times  through  the  pipeline 
reprogramming  the  PEs  to  perform  new  operations  after  each  pass.  Whether  or  not 
this  is  done,  there  is  always  the  disadvantage  that  the  so-called  latency  of  the  pipeline 
the  time  delay  between  an  image  entering  the  first  PE  in  the  chain  and  the  time  it 
leaves  the  last  PE)  may  be  inconveniently  long.  For  example,  although  a  100  PE 
pipeline  might  output  fully  processed  images  at  a  rate  of  10  per  second,  the  latency  in 
the  chain  would  be  10  seconds,  thus  ruling  out  such  a  system  for  real-time  processing 
as  might  be  required  in  a  Visually’  controlled  machine. 


Other  disadvantages  are  the  difficulty  in  feeding  forward  partially  processed  images 
(to  be  used  in  combination  later  in  the  chain)  and  the  virtual  impossibility  of  handling 
eedback  (when  the  parameters  of  the  early  stages  of  processing  have  to  be  adapted  to 
the  results  of  later  stages). 


5.4  M1MD  Arrays 


A  third  approach  to  parallel  image  processing  makes  use  of  a  relatively  small  set  of 
loosely  coupled,  powerftil  PEs,  each  capable  of  independent  operation.  A  typical 
number  would  be  64  or  less  and  the  PE  might  be  a  microprocessor  or  even  a  PC.  In 
principle,  the  image  processing  task  is  shared  between  all  the  PEs  which  then 
communicate  over  a  high  speed  bus  or  some  more  complicated  network  Each  PE 
will  have  its  own  program  store  and  substantial  local  memory  whereas  the  system  as  a 
who  e  will  usually  be  arranged  so  that  one  PE  acts  as  a  master  controller  and  a  major 
block  of  memory  can  be  accessed  by  all  the  PEs.  The  classification  Multiple 
Instruction  stream,  Multiple  Data  stream  is  clearly  applicable  since  each  PE  executes 
its  own  program  on  its  own  part  of  the  data. 


Data  Bus 


M  1-3 


Common 

Memory 


Master 

Controller 


VECPAR  '2000  -  4th  International  Meeting  on  Vector  and  Parallel  Processing 


Fig.  3.  Simple  MIMD  system  with  three  PEs  (each  with  local  memory)  and  a  master 
controller,  together  with  a  common  memory  block 

MIMD  systems  have  not  made  much  impact  on  image  processing.  Just  as  employing 
more  staff  will  not  necessarily  get  a  job  done  more  quickly,  so  it  has  been  found  that 
PF  i  7  more  PEs  to  an  MIMD  system  does  not  always  result  in  faster  processing 
Indeed,  the  additional  overhead  resulting  both  from  subdividing  the  task  and 
from  communicating  between  the  PEs  can  even  result  in  a  reduction  of  performance 
as  more  PEs  are  incorporated.  The  most  serious  objection  to  MIMD  systems  is  that 
they  are  very  difficult  to  program.  Compilers  which  will  efficiently  segment  the 
processing  task  into  blocks,  DT3'  idle  for  much  of  the  time,  rarely 

exist  and,  in  any  case,  :  Instruction  Bus  :y  over  a  range  of  different 

applications.  It  is  therefore  lett  to  the  programmer  to  decide  how  to  employ  the 
parallelism  and  this  will  imply  that  the  programmer  must  know  much  more  about  the 
structure  of  the  hardware  system  than  is  normal  for  software  designers. 


5.5  Special  purpose  devices 


Faced  with  apparently  insuperable  difficulties  in  producing  fast,  efficient,  general 
purpose  image  processing  computers,  some  designers  have  tackled  the  more 
achievable  challenge  to  design  special  purpose  circuits  which  perform  a  very  limited 
range  of  operations.  For  example,  in  some  applications,  an  image  transformed  so  that 
only  the  edges  of  objects  are  displayed  (as  white  lines  on  a  black  background)  can  be 
useful.  Another  application  needs  to  isolate  only  those  parts  of  an  image  which  are 
changing,  perhaps  because  an  object  in  the  scene  is  moving. 

Some  of  these  devices  combine  a  retina- like  array  of  optical  detectors  with  a  matching 
array  of  hard-wired  logic  elements;  other  use  a  sequence  of  hard-wired  processing 
units  in  a  pipeline  configuration.  In  today's  jargon,  systems  such  as  these  could  be 
called  smart  cameras  but  their  smartness  is  strictly  limited  and,  somehow, 
disappointing. 


5.6  Summary 


There  have  been  many  approaches  to  parallelism  in  computers  designed  principally 
for  image  processing.  The  precise  form  of  parallel  architecture  chosen  is  likely  to 
depend  on  the  range  of  tasks  to  be  tackled.  Thus,  systems  to  be  used  for  real-time 
control  based  on  television  cameras  will  almost  certainly  not  be  applicable  to  batch 
processing  of  large  numbers  of  images  collected  by  astronomers.  Again,  devices  for 
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motion  detection  would  have  no  place  in  a  pathology 
smear  analysis. 


laboratory  dedicated  to  cervical 


Parallel  processing  systems  cannot  be  neatly  categorised  and  it  is  doubtful  whether 
there  would  be  any  value  in  doing  so  at  this  stage  in  their  development.  For  those  of 
us  who  have  spent  much  of  our  working  lives  studying  and  designing  such  systems  it 
is  discouraging  to  have  to  admit  that  the  need  for  parallel  systems  in  image  processing 
has  fallen  to  a  low  priority.  The  current  obstacle  to  progress  is  the  lack  of  effective 
algorithms;  workstations  and  the  latest  generations  of  PC  are  usually  quite  fast 
enough  for  anything  that  needs  to  be  done. 


6.  Historical  Background 


6.1  Pioneer  research 


Blindness  is  a  terrible  affliction.  Most  of  the  human  environment  is  designed  or  has 
been  adapted  on  the  assumption  that  we  can  see  and  the  vast  majority  of  tasks 
performed  by  humans  rely  on  human  vision  to  provide  the  necessary  feedback  to 
control  performance.  Without  the  gift  of  vision,  humans  are  greatly  restricted  in  what 
they  can  do. 


In  the  same  way,  the  development  of  sophisticated  automation,  especially  in  the 
manufacturing  industry,  has  been  retarded  by  the  lack  of  competent  computer  vision 
systems  This  is  particularly  serious  with  respect  to  inspection  of  manufactured  parts 
and  similar  problems  occur  in  medicine  in  the  areas  of  mass  screening;  the  subject  of 
optical  character  recognition  has  already  been  mentioned  in  this  review.  Pure  science 
would  also  benefit  if  it  were  possible  to  automate  the  analysis  of  photographic  images 
produced  in  many  research  areas,  high  energy  particle  physics  and  astronomy  being 
the  earliest  of  these  to  generate  this  requirement. 

mfdeuquate  Performance  of  even  the  fastest  available  computers  in  the  early 
60s  (when  the  demand  for  computer  vision  was  beginning  to  become  apparent) 
stimulated  computer  scientists  to  turn  their  attention  to  the  research  that  was  then  in 
progress  investigating  the  mechanisms  underlying  biological  vision.  Two  seminal 
papers  in  this  area  were  the  study  of  frog  vision  by  Lettvin  et  al  [14]  and  a  slightly 
later  paper  by  Hubei  and  Wiesel  on  cat  vision  [13].  Herscher  and  Kelley  embodied 
the  ideas  behind  the  first  paper  in  a  hardware  demonstration  [12], 

In  the  studies  of  both  the  frog  and  the  cat,  the  anatomy  of  the  visual  system  was  seen 
to  embody  an  array  of  photodetectors  (rods  and/or  cones)  forming  the  retina  with  the 
electrical  outputs  of  the  photodetectors  being  cross-connected,  effecting  both 
summation  and  lateral  inhibition  (i.e.,  a  strong  output  from  one  photodetector  reduces 
the  strength  of  the  output  from  its  neighbours).  The  modified  outputs  are  fed  via  a 
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bundle  of  nerve  fibres  (the  optic  nerve)  through  to  the  visual  cortex  of  the  brain  where 
layer  upon  layer  of  densely  interconnected  neurons  carry  out  parallel  logic  operations 
on  the  retinal  outputs.  In  the  case  of  the  frog,  only  a  very  small  number  of  image 
properties  can  be  extracted  from  the  optical  data,  such  as  detection  of  an  object 
moving  into  the  field  of  view.  However,  the  cat's  visual  system  is  very  similar  to  the 
human's  and  is  therefore  capable  of  greatly  sophisticated  scene  analysis.  In  all  these 
studies,  the  anatomical  investigation  was  supplemented  by  physiological 
measurements  and  much  was  learnt  about  how  the  systems  effected  their  processing. 


At  approximately  the  same  time  as  this  work  was  started,  Unger  published  the  first  of 
his  two  papers  [24], [25]  proposing  a  processor  array,  although  he  did  not  build  an 
array  himself;  in  fact,  these  papers  seem  to  be  his  last  contact  with  the  subject  of 
computer  architecture.  His  papers  described  a  theoretical,  square  array  of  simple 
logic  elements,  each  of  which  could  receive  data  from  or  send  data  to  any  of  its  four 
neighbours.  He  demonstrated  that  his  array  could  execute  simple  but  useful  functions 
on  arrays  of  data  of  the  same  dimensions  as  the  logic  array  but  he  did  not  suggest  how 
these  logic  elements  could  be  implemented  in  hardware.  Fortunately,  the  Unger 
papers  served  to  inspire  others  who  then  did  construct  hardware  based  on  the  ideas  he 
had  expressed.  Another  pioneer  was  Golay  [10,  11]  whose  processor  proposals, 
although  conceived  as  a  serial  device,  were  turned  into  hardware  by  Preston  [16]  who 
was  well  aware  that  a  more  parallel  version  could  have  been  constructed. 

Computers  whose  designs  were  based  loosely  on  Unger's  ideas  were,  in  order  of 
construction,  Solomon  [19],  ELLIAC  III  [17],  ILLIAC  I\  [1],[21]  and  DAP  [7].  It  is 
not  clear  whether  Solomon  was  actually  constructed  and  made  to  operate  but  ILLIAC 

III  caught  fire  before  it  could  be  completed  and  only  'worked'  in  simulation.  ILLIAC 

IV  was  only  partially  completed  but  sufficient  was  built  to  enable  it  to  carry  out  many 
large-scale  computations.  DAP  started  being  developed  in  1973,  was  prototyped  in 
1976  and  put  into  commercial  production  in  1980.  The  last  machines  in  this  sequence 
were  MPP  [2]  which  first  appeared  in  1983  and  the  Connection  Machine  [13]  which 
later  evolved  into  the  commercial  CM  series  of  massively  parallel  processor  systems. 

Parallel  processing  research  in  the  Image  Processing  Group  in  the  Department  of 
Physics  at  University  College  London  (UCL)  was  initially  influenced  by  the 
biological  papers  listed  above.  The  research  into  parallel  processing  followed  some 
seven  years  of  development  of  semi-automatic  microscopes  and  other  image  analysis 
equipment  (1958-1965),  constructed  for  the  three  High  Energy  Particle  Physics 
groups  in  the  same  department.  Unger1  s  paper  was  not  seen  by  the  UCL  group  until 
many  years  later  and  it  was  surprising  to  see  how  the  two  disconnected  lines  of 
research  had  by  then  converged.  At  this  time,  another  field  of  research  was  also 
coming  into  being:  Neural  Networks.  The  pioneer  work  here  was  carried  out  by 
Rosenblatt  [20]  who  devised  the  Perceptron.  This  circuit  loosely  simulated  a  neuron 
and  introduced  the  idea  of  constructing  circuits  which  could  be  trained  to  make 
decisions  by  adjusting  the  values  of  certain  circuit  elements  (usually  variable 
resistors)  in  response  to  a  set  of  training  patterns.  The  strengths  of  selected  pattern 
features  were  translated  into  voltages  which  were  then  summed  through  the  variable 
resistors  (one  for  each  feature).  The  resulting  summed  voltage  was  then  compared 
with  a  threshold  voltage  and  the  pattern  classified  as  class  A  (sum  at  or  above  the 
threshold)  or  class  B  (sum  below  the  threshold).  If  necessary,  the  automatic  trainer 
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then  adjusted  the  variable  resistors  appropriately  to  correct  the  decision  and  a  new 
pattern  was  presented.  It  could  be  shown  that  the  process  would  converge  so  that, 
ultimately  all  the  circuit's  decisions  were  correct  for  the  training  set  and  would 
generally  be  correct  for  similar  but  previously  unseen  patterns.  This  work  was  also 
influential  on  the  UCL  programme. 


6.2  Research  at  UCL 


UCPR1 


A  research  grant  application  written  in  1965  to  request  support  for  the  UCL  research 
programme  is  of  interest.  It  could  be  submitted  even  today  with  very  little 
modification  since  it  addresses  problems  which  are  relevant  to  the  design  of  parallel 
processing  systems  and  are  still  unsolved: 


One  of  the  main  limitations  on  the  design  of  neuron-like  networks 
has  been  the  prohibitive  cost  of  constructing  circuits  which  involve 
very  large  numbers  of  circuit  elements  together  with  a  high  degree 
of  interconnection  between  the  elements.  If  these  limitations  were 
to  be  removed  by  exploiting  some  of  the  relatively  new  techniques 
Jor  the  production  of  microminiature  circuits,  then  it  might  prove 
possible  to  develop  networks  which  would  embody  some  of  the 
considerable  analytical  facilities  of  neural  nets.  In  addition,  the 
increased  component  density  would  permit  a  measure  of 
redundancy ,  and  local  failure  would  not  impair  efficiency  of  the  net. 

his,  in  its  turn,  would  allow  the  use  of  circuit  construction 
techniques  which  do  not  produce  component  values  within  close 
tolerances. 


. —The  last  part  of  the  proposed  programme  which  has  been 

envisaged  would  comprise: 

a)  The  construction  of  transistor  models  of  neural  elements  with  a 
view  to  producing  a  critical  survey  of  their  properties  and  to 
designing  improved  elements; 

b)  Assembly  of  such  elements  into  various  arrays,  exploring  the 
numerous  modes  of  interconnection; 

c)  Simulation  of  such  networks  by  means  of  computer  programs,  and 
developing  appropriate  mathematical  methods  to  handle  the  logical 
circuit  analysis; 

d)  Translation  of  the  circuitry  into  microminiature  components,  and 
utilisation  of  circuit  replication  techniques 
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Application  to  the  UK  Department  of  Scientific  and  Industrial 
Research  for  support  for  a  research  programme  entitled  ‘Pattern 
Recognition  Matrices',  dated  March  1965' 

This  programme  resulted  in  the  construction  and  demonstration  in  September  1967  of 
UCPR1  [4].  Integrated  circuits  were  not  generally  available  at  this  time  so  the  active 
components  in  UCPR1  were  diodes  and  transistors.  Regions  of  interest  in 
photographs  of  the  tracks  of  high  energy  charged  particles  (in  nuclear  emulsions  and 
from  cloud  chambers  and  bubble  chambers)  are  characterised  by  either  a  sharp  change 
in  direction  or  by  a  branching  of  the  track  into  two  or  more  components.  Automated 
scanning  equipment  had  been  built  which  needed  manual  centring  on  these  regions  so 
UCPR1  was  designed  to  show  the  possibility  of  making  a  retina-like  device  which 
would  detect  the  regions  automatically. 


Fig.  4.  A  working  demonstration  of  the  parallel  processor  UCPR1,  as  demonstrated  at  the 
Physical  Society  Exhibition  in  London  in  1967.  The  lamp  at  the  top  illuminates  a  track 
chamber  photograph  placed  over  an  array  of  photodiodes.  The  electric  lamp  array  to  the  right 
shows  the  location  of  vertices  in  the  photograph. 
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The  input  to  the  system  was  a  square  array  of  256  photodiodes  onto  which  the  track 

photograph  was  projected.  Hard-wired  circuits  were  layered  under  the  photodiodes 

and  performed  the  following  functions: 

1.  Amplification; 

2.  Summation  over  a  3  x  3  window  surrounding  each  photodiode; 

3.  Non-linear  amplification  of  the  summed  output,  saturated  by  at  least  two  out  of 
the  nine  possible  inputs 

4.  Summation  over  the  outer  edge  of  the  5x5  window  centred  on  each  amplifier 
output 

5.  Comparison  of  the  final  output  with  a  variable  threshold,  scanned  from  a  high 
value  downwards  and  designed  to  locate  the  maximum  summed  output. 

6.  The  threshold  scanner  stopped  scanning  as  soon  as  a  maximum  was  detected  and 
the  final  layer  outputs  were  fed  to  a  256  x  256  array  of  light  bulbs.  The  bulb  or 
bulbs  which  lit  indicated  the  position  of  the  detected  region  of  interest  (referred 
to  as  a  vertex).  The  variable  threshold  scanned  at  50  Hz  so  vertices  could  be 
detected  in  real-time,  i.e.,  once  every  20msec. 


The  Diode  Array 

UCPR1  achieved  what  it  set  out  to  do:  it  successfully  detected  and  located  vertices  in 
charged  particle  track  photographs.  A  small  piece  of  extra  hardware  showed  that  it 
could  also  be  used  to  detect  ends  of  lines  and  a  further  extension  enabled  UCPR1  to 
recognise  carefully  drawn  alphanumerics  (but  not  the  complete  alphabet)  by  analysing 
the  locations  of  ends  and  vertices.  The  obvious  weakness  of  the  UCPR1  concept  was 
that  each  layer  of  processors  could  only  execute  a  single  logic  function.  In  effect, 
UCPR1  was  unprogrammable. 
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Fig  5  A  single  PE  of  the  Diode  Array,  showing  a  neon  indicator  (ON  or  OFF  for  one  or  zero 
outputs)  and  l double-pole,  double-throw  switch  to  allow  zero  or  one  to  be  entered 

The  Diode  Array  project  [5]  was  the  first  attempt  to  determine  what  was  the  simplest 
specification  for  a  processing  element  (PE)  that  would  enable  it  to  be  programmed  to 
perform  all  possible  functions  on  arrays  of  data.  Consideration  of  the  experience 
gained  in  studying  UCPR1  and  also  taking  into  account  what  was  then  known  about 
the  construction  of  the  mammalian  retina,  led  to  the  proposal  that  each  PE  should  be 
able  to  input,  store  and  output  single-bit  data,  should  be  capable  of  inverting  data  and, 
finally,  should  be  connected  to  neighbours  in  such  a  way  that  data  from  neighbours 
could  be  input  as  a  logical  OR  of  all  four  inputs. 

The  basic  PE  is  shown  in  figure  5.  It  includes  a  neon  bulb  which  glowed  to  show  a  1 
output  (dark  for  a  0  output)  and  the  points  labelled  A  to  J  and  +  were  initially  left 
unconnected.  The  double  pole  switch  was  used  to  input  a  1  or  a  zero  (corresponding 
to  its  ON  and  OFF  positions).  A  small  5x5  array  was  constructed  and  additional 
electromechanical  relay  circuits  added  to  enable  the  user  to  systematically  connect 
together  various  combinations  of  the  labelled  circuit  points,  the  same  combination  in 
each  PE.  Treating  the  switch  state  as  representing  black  and  white  image  data,  it 
could  be  demonstrated  that  functions  such  as  image  inversion,  object  edge  extraction 
and  object  expansion  and  shrinking  could  be  effected. 


A  computer  simulation  of  the  array  was  written  together  with  a  Monte  Carlo  program. 
This  applied  a  wide  range  of  random  intra-processor  connections  (between  the 
labelled  points)  with  a  view  to  discovering  all  differing  image  processing  operations 
which  could  be  implemented  by  the  array.  The  otherwise  exhaustive  search  was 
narrowed  by  eliminating  obviously  useless  connections,  such  as  connecting  the 
positive  voltage  supply  (+)  to  Earth.  Rerunning  the  program  many  times  established 
the  existence  of  more  than  70  processing  functions.  For  reasons  that  are  not  clear, 
those  functions  which  had  been  built  into  the  hardware  array  were  discovered  by  the 
Monte  Carlo  program  earliest  in  its  operation. 

Because  the  connections  between  PEs  were  combined  by  OR-gates  to  provide  a  single 
input  into  neighbouring  PEs,  the  array  had  no  'sense  of  direction'.  For  example,  it 
would  never  be  able  to  detect  that  one  object  lay  above  another  in  an  image.  Also,  the 
obsolescent  hardware  components  used  to  construct  the  array  imposed  undesirable 
constraints  on  the  implementation  of  the  logic  functions.  The  next  stage  in  the 
research  programme  utilised  first  small  scale,  next  medium  scale  and  finally  large 
scale  integration. 


The  CLIP  Project 

Continuing  the  search  for  the  optimum  PE  design,  a  series  of  array  processors  was 
constructed.  These  so-called  Cellular  Logic  Image  Processors  (CLIP1  to  CLIP4) 
were  gradually  increased  in  complexity  thus  allowing  each  to  be  thoroughly 
understood  before  additional  sophistication  was  permitted.  CLIP1  and  CLIP2  will  not 
be  described  here  as  all  their  important  features  were  included  in  CLIP3.  CLIP3  will 
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also  not  be  discussed  in  detail  since  its  main  purpose  was  to  provide  a  design  study 
for  a  fully  integrated  version  which  could  be  manufactured  and  marketed.  In  fact,  for 
reasons  of  cost,  CLIP4  was  slightly  less  complex  than  CLIP3.  The  higher  level  of 

*2TS  pcCLIuP4  (8  PES  per  integrated  circu,0  made  it  economic  to  build  an  airay 
of  96  x  96  PEs  whereas  CLIP3  had  only  16  x  12  PEs  and  was  not  of  practical  value 
for  applied  image  processing. 


The  logic  functions  of  CLIP4  are  shown  in  outline  in  Figure  6.  At  the  heart  of  the  PE 
are  two  identical  minterm  generators.  Each  has  two  binary  data  inputs  (A,  the  value 
of  the  local  pixel,  and  a  composite  value  derived  from  another  pixel  value  stored  in  B 
and  from  data  from  neighbours),  one  binary  output  and  four  binary  control  inputs  Bv 
applying  any  of  the  sixteen  possible  4-bit  binary  control  words  to  a  generator,  any  of 
the  sixteen  possible  Boolean  combinations  of  the  two  inputs  can  be  produced  at  the 
output.  The  output  from  the  lower  generator  is  distributed  to  neighbouring  PEs  and 
the  upper  output  is  stored  as  a  result.  Each  PE,  on  receiving  inputs  from  neighbours 
selects  a  subset  by  means  of  a  programmable  gate  and  ORs  the  subset  with  a  single  bit 
st,°,red  *"  '°cal  memory  (B  in  the  figure).  Further  gates  allow  the  PE  to  act  as  a  full 
adder.  Additional  connections  are  used  to  input  and  output  data  to  and  from  the  array 
The  detailed  operation  of  the  PE  is  too  complex  to  describe  in  the  space  available  for 
this  review  but  a  full  description  of  the  CLIP3/CLIP4  systems  can  be  found  in  [3]  [6] 


Fig.  6,Schematic  logic  diagram  of  the  CLIP4  processing  element 


Three  classes  of  operation  can  be  performed  by  these  CLIP  processors.  They  are 
those  in  which: 

•  Each  output  pixel  is  a  function  only  of  the  corresponding  input  pixel; 

•  Each  output  pixel  is  a  function  of  the  corresponding  input  pixel  and  of  the  eight 
pixels  surrounding  it; 
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•  Each  output  pixel  is  a  function  of  the  corresponding  input  pixel  and  of  any  other 
connected  to  it  by  propagation  through  chains  of  neighbouring  pixels. 

One  further  feature  of  the  array  is  an  OR-gate  (not  shown  in  the  figure)  with  inputs 
from  every  PE,  used  to  determine  whether  a  binary  image  stored  in  the  PEs  has  at 
least  one  pixel  which  is  non-zero.  In  general,  the  PE  processes  single  bit  binary  data 
in  each  operation;  multiple  bit  data  is  processed  one  bit  at  a  time,  i.e.,  bit-serially. 
Although  beyond  the  scope  of  this  review,  it  can  easily  be  shown  that  an  array  of  PEs 
with  the  features  listed  here  can  be  programmed  to  perform  all  image  operations  and, 
indeed,  all  mathematical  calculations.  In  short,  CLIP3  and  CLIP4  are  universal 
computing  systems. 

The  development  of  CLIP4  extended  from  1974  to  1980.  At  that  time,  the  CLIP4 
integrated  circuit  was  the  largest  ever  to  be  manufactured  in  the  UK  under  contract  to 
the  universities  and  the  technical  difficulties  experienced  were  immense.  After  this 
worrying  development  period,  CLIP4  was  applied  to  many  image  processing  projects 
and  was  in  constant  use  for  the  next  10  years.  It  was  certainly,  at  the  start,  the  largest 
working  parallel  processor  array  in  the  world  and  achieved  the  fastest  real-time  image 
processing  at  that  time. 


7.  Limitations  of  Image  Processors 


Every  dedicated  image  processing  system  has  its  limitations.  Most  embody  as  much 
parallel  structure  as  is  practicable  but  every  design  falls  short  in  some  way  or  another. 
Special  purpose  circuits  providing  a  very  restricted  range  of  functions  can  only  be  of 
similarly  restricted  applicability,  although  some  attempts  have  been  made  to  build 
computers  combining  several  special  purpose  circuits  into  one  composite  system. 
Their  performance  is  not  impressive  since  most  of  the  units  are  idle  for  most  of  the 
time  and  the  effective  parallelism  is  low. 

The  latency  effect  in  pipeline  processors  together  with  the  difficulty  experienced  in 
programming  them  in  many  applications  has  resulted  in  such  systems  falling  into 
disuse.  Processor  arrays  are  also  not  easy  to  program  although  this  is  a  skill  which 
can  be  learned;  there  are  no  insurmountable  difficulties  in  writing  parallel  forms  of 
most  image  processing  operations. 

A  more  serious  problem  is  that  processor  arrays  suffer  from  two  related  inefficiencies. 
The  first  is  that,  in  general,  moving  images  in  and  out  of  the  array  is  a  serial  process 
and  therefore  slow.  Secondly,  moving  data  between  extremes  of  the  array  (as,  for 
example,  is  necessary  when  performing  Fourier  Transforms)  involves  stepping 
through  chains  of  neighbouring  PEs  and  is  also  very  slow.  Both  these  inefficiencies 
can  be  lessened  by  adding  more  connection  paths  and  this  has  been  done  in  later 
systems,  such  as  the  Connection  Machine  [13].  A  further  problem  is  cost.  Processor 
arrays  are  much  less  efficient  when  the  size  of  the  image  array  is  larger  than  that  of 
the  processor  array.  Unless  the  level  of  integration  can  be  made  very  high,  the  cost  of 
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constructing  and  assembling  enough  PEs  to  match  images  of  television  quality  is  too 
great  tor  the  majority  of  potential  users. 


Research  into  parallel  processing  architectures  for  image  processing  has  slowed  down 
noticeably  ,,, i  recent  years^  On  the  one  hand,  the  high  cost  of  building  these  machines 
has  made  it  difficult  to  obtain  funding  from  the  organisations  which  used  to  support 
this  research^  Equally,  the  long  lead  time  for  the  production  of  new  systems,  taken 
together  with  the  limited  and  uncertain  market  for  the  systems  once  they  are 
produced,  has  discouraged  manufacturing  industiy  from  continuing  to  be  involved 


However  possibly  the  major  factor  which  slowed  the  pace  of  this  field  of  research 
was  the  lessening  of  demand  from  the  image  processing  community.  The  wide¬ 
spread  availability  of  high-powered  workstations  and  the  ever  increasing 
performance/cost  ratio  of  PCs  have  meant  that  the  priority  for  development  of 
systems  with  higher  speed  has  been  displaced  by  a  need  for  more  effective  algorithms 
m  the  majority  of  active  areas  in  applied  image  processing.  It  is  also  the  experience 
ot  many  in  the  field  that  the  image  processing  software  packages  which  can  be 
purchased  tend  to  be  disappointingly  inflexible,  especially  when  there  is  a  need  to 
incorporate  new  functions  not  contained  in  the  original  package.  Consequently 
development  effort  has  been  switched  from  hardware  to  software 


A  cynical  comment  on  the  state-of-the-art  in  image  processing  (or,  perhaps  more 
accurately,  image  analysis)  would  be  that  the  computers  now  commercially  available 
enable  us  to  run  bad  programs  adequately  quickly  and  the  use  of  even  the  best  parallel 
processing  methods  would  do  nothing  more  than  allow  us  to  get  poor  results  even 
taster.  The  same  cannot  be  said  about  image  generation,  a  wide-ranging  subject 
embracing  important  and  socially  useful  applications  in  the  medical  field  as  well  as 
commercially  profitable  activities  in  computer  games.  In  this  area,  there  is  always  a 
demand  for  higher  performance. 


8.  Predictions  for  the  Future 


Although  the  study  of  computer  vision  seems  to  be  very  unstructured  and  not 
progressing  as  well  as  had  been  optimistically  expected  three  decades  ago,  there  is 
still  enough  optimism  amongst  researchers  to  merit  laying  plans  for  the  future  when, 
it  is  believed,  successful  algorithms  will  have  been  developed  and,  once  again  the 
need  will  be  for  faster  processors.  Enough  is  now  understood  about  computer 
architecture  to  make  it  certain  that  adequately  fast  processing  will  only  be  achieved  by 
the  use  of  parallelism.  At  the  same  time,  every  attempt  will  have  to  be  made  to 
employ  the  fastest  possible  components. 

There  are  physical  limitations  to  the  extent  to  which  integrated  circuit  devices  can  be 
made  faster.  Current  research  is  exploring  these  limitations  by  investigating 
nanotechnology  where  circuit  components  are  defined  with  a  precision  approaching 
one  nanometre  (10  metre).  If  devices  can  be  made  to  work  with  such  dimensions, 
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then  it  would  be  conceivable  that  a  CLIP4  array  of  size  512  x  512,  together  with 
adequate  amounts  of  memory  local  to  each  PE,  could  be  formed  on  one  integrated 
circuit  slice.  Furthermore,  images  could  be  input  to  the  slice  by  projecting  them  onto 
photosensitive  components  located  with  each  PE.  The  power  of  such  a  system  would 
far  exceed  anything  now  in  use  and  the  cost,  assuming  the  technology  had  been  given 
time  to  'mature',  would  be  a  mere  fraction  of  that  of  today's  supercomputers. 

Undoubtedly,  there  will  be  many  major  technical  problems  to  solve.  At  this  scale, 
long  connections  between  parts  of  the  array  are  difficult  to  fabricate.  In  particular,  the 
distribution  of  control  instructions  synchronously  across  the  array  will  be  hard  to 
achieve.  Potential  failure  of  devices  embedded  in  the  array  will  have  to  be  combated 
by  the  liberal  use  of  redundancy. 

There  are  some  indications  that  it  may  be  hard  to  define  and  control  the  characteristics 
of  the  millions  of  devices  in  these  giant  arrays.  If  this  is  true,  then  a  new  style  of 
programming  might  be  necessary  in  which  variability  is  not  only  accepted  but  also 
exploited.  A  Monte  Carlo  program  running  on  a  conventional  computer  gains  its 
power  to  solve  problems  by  introducing  random  numbers  into  what  would  otherwise 
be  a  completely  predictable  performance;  could  it  be  that  a  similar  broadening  of 
capability  might  be  obtained  by  randomising  the  values  of  some  of  the  device 
parameters  in  the  processor  arrays? 

There  is  a  philosophical  point  to  be  made  here.  We  have  always  looked  to  human 
vision  as  a  sort  of  role  model  for  computer  vision  system  designers  but  this  may  have 
been  unwise.  Human  vision  is  there  to  enable  humans  to  survive  in  their 
environment,  not  to  equip  humans  with  a  precise  optical  measuring  system.  In 
everyday  life,  a  broad,  comprehensive  view  of  the  world  is  all  that  is  needed  and  the 
speed  at  which  this  must  be  obtained  is  only  of  the  order  of  human  reaction  time,  i.e., 
an  analysis  in  a  few  tens  of  milliseconds. 

On  the  other  hand,  computer  vision  has  generally  been  used  to  make  fast  and  accurate 
measurements  in  a  very  constrained  environment.  This  may  imply  that  at  least  two 
very  different  types  of  image  processing  computer  will  be  need:  one  in  which  speed 
and/or  accuracy  are  the  dominating  goals  and  the  other  in  which  speed  need  not  be  of 
the  highest  but  robustness  in  an  unconstrained  environment  will  be  of  fundamental 
importance. 

Nevertheless,  the  ideas  behind  parallel  processing  computing  are  justified  by  the 
physiological  example  from  which  they  sprang  and  that  they  were  found  to  be 
effective  when  applied  to  computer  architecture.  It  is  difficult  to  conclude  that 
tomorrow's  computers  will  revert  to  a  serial  architecture.  Parallelism  is  definitely 
here  to  stay. 
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Abstract.  This  work  presents  an  efficient  implementation  of  a  hierar¬ 
chical  radiosity  algorithm  on  a  distributed-memory  multiprocessor.  The 
parallel  algorithm  is  based  on  a  coarse-grain  approach  that  avoids  load 
imbalance  by  means  of  a  dynamic  scheduling  strategy.  Experimental  re¬ 
sults  on  the  Fujitsu  AP3000  multiprocessor  using  MPI  show  that  this 
kind  of  architectures  are  appropriate  to  implement  hierarchical  radiosity 
methods  as  a  stage  of  a  image  synthesis  environment. 


1  Introduction 

Digital  image  synthesis  is  a  field  of  computer  graphics  whose  aim  is  the  genera¬ 
tion  of  realistic  2D  digital  images  that  emulate  3D  objects.  In  order  to  achieve  the 
desirable  degree  of  realism,  it  is  important  to  use  global  illumination  algorithms 
that  take  into  account  the  influence  of  each  object  located  at  the  environment. 

The  radiosity  method  is  a  global  illumination  model  widely  used.  The  main 
advantage  of  this  method  lies  in  the  fact  that  the  obtained  illumination  results 
are  independent  of  the  viewpoint.  Nevertheless,  its  drawback  is  the  high  com¬ 
putational  cost,  both  in  CPU  time  and  memory  requirements.  For  this  reason, 
several  approaches  of  the  method  have  been  proposed:  progressive  radiosity  [2], 
hierarchical  radiosity  [6]  and,  more  recently,  wavelet  radiosity  algorithms  [9]. 
This  work  is  focussed  on  the  parallelisation  and  scheduling  of  the  hierarchical 
method.  Although  this  method  drastically  reduces  the  complexity  of  the  classical 
radiosity  algorithm,  it  still  has  a  significant  computational  cost,  which  justifies 
the  use  of  parallel  computing  techniques. 

In  the  literature,  good  results  have  been  reported  on  shared-memory  multi¬ 
processors  [10],  where  all  processors  have  access  to  the  whole  scene,  and  the  only 
bottleneck  is  the  necessary  control  of  R/W  operations  to  avoid  critical  section 
problems  and  deadlocks.  However,  the  results  on  distributed-memorj  multipro¬ 
cessors  are  not  so  encouraging,  mainly  due  to  the  communications  overhead. 
Zareski  et  al.  [11]  applied  fine-grain  parallelism  using  a  master-slave  paradigm, 
where  each  slave  performed  the  ray-polygon  intersection  computations  on  the 
corresponding  subset  of  patches  of  the  scene.  In  this  case,  the  speedup  of  the 
algorithm  is  restrained  by  the  bottleneck  of  having  a  master  processor  and  the 
large  number  of  communications  required.  Other  implementations  also  follow  a 
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master-slave  model,  but  using  a  coarse-grain  parallelism.  In  that  case,  each  slave 
performs  the  whole  computation  of  the  radiosity  on  a  group  of  patches  of  the 
scene,  and  the  master  takes  charge  of  the  dynamic  patch  distribution,  as  well 
as  the  convergence  analysis.  Among  these  implementations,  one  approach  is  to 
store  the  complete  scene  in  the  local  memory  of  each  processor  [1],  [3]  in  order 
to  minimize  communications,  although  large  scenes  cannot  be  processed  due  to 
memory  requirements.  Another  approach  is  to  distribute  the  scene  among  the 
processors  [4],  which  allows  to  work  with  large  scenes,  although  communications 
increase  to  a  great  extent. 

In  this  work,  we  propose  a  parallel  implementation  of  a  hierarchical  radiosity 
method  on  a  distributed-memory  multiprocessor.  The  scene  is  replicated  in  all 
the  processors,  and  the  load  is  dynamically  balanced  to  avoid  idle  processors. 
We  have  used  an  SPMD  (Simple  Program  Multiple  Data)  paradigm,  that  is  we 
do  not  waste  one  processor  (master  or  scheduler)  on  load  distribution  tasks  Our 
scheduling  is,  therefore,  distributed. 

This  work  is  organized  as  follows:  next  section  describes  the  sequential  algo¬ 
rithm  of  the  hierarchical  radiosity  method;  the  parallel  versions,  both  for  a  static 
load  distribution  and  for  a  dynamic  scheduling  are  presented  in  Section  3.  Ex¬ 
perimental  results  on  the  Fujitsu  AP3000  multiprocessor  are  shown  in  Section  4. 
Finally,  conclusions  and  future  work  are  discussed  in  Section  5. 


2  The  Hierarchical  Radiosity  Algorithm 

The  radiosity  method  is  based  on  applying  to  image  synthesis  the  concepts  of 
thermodynamics  that  rule  the  balance  of  energy  in  a  closed  environment.  In  fact, 
the  radiosity  method  solves  a  global  illumination  problem  expressed  by  Kajiya’s 
equation  [8],  simplified  by  considering  only  ideal  diffuse  surfaces.  The  resultant 
equation  system  is: 


Bi  -  Ei  + piY^BjFij  ,  (l) 

j=i 

where  Bt  is  the  radiosity  of  patch  i,  Et  is  the  emittance  and  pi  the  diffuse 
reflectance.  The  summation  represents  the  contributions  of  the  other  patches 
of  the  scene,  where  Fij  is  the  form  factor  between  patches  i  and  j.  This  factor 
represents  the  fraction  of  energy  that  leaves  from  a  polygon  and  reachs  directly 
another  one.  It  is  an  adimensional  constant  that  only  depends  on  the  geometry 
of  the  scene.  The  number  of  form  factors  between  all  pairs  of  n  patches  is  0{n2), 
w  hich  makes  traditional  radiosity  methods  very  expensive. 

The  complexity  of  the  radiosity  computation  is  dramatically  reduced  by  using 
the  hierarchical  method.  It  subdivides  the  scene  adaptively,  applying  the  fact 
that  small  details  are  not  significant  at  long  distances;  besides,  the  hierarchical 
method  avoids  computing  some  interactions  if  their  form  factors  are  zero,  because 
it  means  that  the  patches  cannot  see  each  other.  The  scene  is  divided  into  patches 
(much  larger  than  the  ones  used  in  the  classical  radiosity  methods)  that  make 
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Fig.  1.  Interactions  in  an  element  hierarchy 


up  the  coarsest  level  of  the  hierarchy.  These  patches  are  successively  subdivided 
into  elements  through  an  iterative  process,  until  the  desirable  precision  in  the 
illumination  of  the  scene  is  achieved  (see  Fig.  1). 

The  sequential  algorithm  of  the  hierarchical  method  based  on  [6]  can  be 
described  as  follows: 

1.  A  BSP  (Binary  Space  Partition)  tree  is  built  with  the  input  polygons  or 
patches  (Fig.  2  shows  an  example).  This  tree  will  be  useful  later  to  determine 
the  visibility  between  two  patches  in  an  efficient  way. 

2.  For  each  patch  inserted  in  the  BSP  tree,  a  list  of  initial  interactions  (or 
links)  is  computed.  Each  entry  of  this  list  has  as  destination  other  patch 
of  the  scene,  potentially  visible  from  the  current  patch.  At  this  stage  we 
consider  that  two  patches  are  potentially  visible  if  their  positive  sides  are 
face  to  face.  The  form  factor  between  the  two  patches  involved  is  computed 
for  each  interaction.  Once  the  initial  interactions  for  all  the  patches  have 
been  computed,  the  iterative  process  that  gathers  the  radiosity  of  each  patch 
begins  in  the  next  step. 

3.  For  each  patch,  the  radiosity  obtained  from  all  its  visible  interactions  is  cal¬ 
culated.  If  the  radiosity  emitted  by  a  certain  link  exceeds  a  given  threshold, 
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the  interaction  must  be  refined  (in  this  work,  we  use  a  BF  refinement  [5]). 
To  perform  this  task,  either  the  source  element  or  the  destination  element 
of  the  interaction  (depending  on  which  of  them  has  the  largest  area)  is  sub¬ 
divided  into  a  quadtree,  where  the  children  inherit  the  current  radiosity  of 
the  father.  The  refinement  is  as  follows: 

(a)  If  the  element  to  be  subdivided  is  the  source  element  of  the  interaction, 
four  new  interactions  with  each  one  of  the  children  of  the  subdivided 
element  are  established  in  the  destination  element. 

(b)  If  the  element  to  be  subdivided  is  the  destination  element,  each  one  of 
its  children  inherits  the  interaction  with  the  source  element. 

In  both  cases,  the  old  interaction  is  discarded.  For  each  patch,  the  radiosity  of 
its  current  hierarchy  of  elements  is  computed  through  a  post-order  traversing 
of  the  quadtree. 

4.  Once  all  interactions  between  the  elements  of  the  scene  have  been  processed, 
the  complete  radiosity  of  the  scene  is  summed  up  and  the  convergence  is 
checked  by  comparing  this  value  with  the  result  obtained  in  the  previous  ite¬ 
ration.  If  the  convergence  criterion  is  fulfilled,  the  algorithm  finishes;  other¬ 
wise,  a  new  iteration  begins  in  step  3. 

3  The  Parallel  Algorithm 

Two  parallel  versions  of  the  hierarchical  algorithm  have  been  implemented,  using 
the  message-passing  library  MPI.  Both  versions  are  based  on  a  coarse-grain 
approach,  that  is,  each  processor  performs  the  whole  computation  of  the  radiosity 
for  a  set  of  patches  of  the  scene.  In  the  first  approach,  a  static  assignment  of  the 
patches  to  the  processors,  without  applying  any  kind  of  scheduling,  is  carried 
out.  Using  this  implementation,  good  results  could  be  obtained  for  images  that 
give  rise  to  a  regular  load  distribution  among  the  processors.  Nevertheless  in 
most  cases  the  execution  of  the  algorithm  causes  load  imbalance  due  to  the 
unpredictable  behaviour  of  the  refinement,  which  results  in  poor  speedups:  this 
fact  is  more  significant  as  the  number  of  processors  increases.  Thus,  we  have 
developed  a  second  version  of  the  parallel  algorithm  to  balance  the  computations 
t  rough  a  distributed  dynamic  scheduling.  The  following  subsections  describe 
both  approaches. 


3.1  Static  Load  Distribution 

The  first  parallel  algorithm  we  have  developed  distributes  the  workload  among 
the  processors  so  that,  assuming  n  patches  and  p  processors,  each  processor 
computes  the  radiosity  for  a  fixed  set  of  n/p  patches.  Next,  the  changes  with 
respect  to  the  sequential  algorithm  are  detailed: 

1.  It  is  not  worth  parallelizing  the  BSP  tree  construction  and  the  computation 
of  the  initial  interactions  because  their  execution  times  are  negligible  as 
compared  with  the  whole  radiosity  algorithm.  Thus,  each  processor  generates 
its  own  local  copy  of  the  whole  BSP  tree.  As  new  feature,  once  the  patches 
are  inserted  in  the  BSP  tree,  they  are  sorted  in  decreasing  area  order. 
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2.  Before  beginning  the  iterative  process,  the  sorted  patches  are  cyclically  assig¬ 
ned  to  the  processors,  in  order  to  balance  the  load  among  the  processors. 
That  is,  the  patches  whose  order  in  the  list  is  t,  such  that  t  mod  i  =  0  are 
assigned  to  processor  i. 

3.  During  the  iterative  process  in  which  radiosity  is  computed,  each  processor 
only  takes  charge  of  its  assigned  patches.  Besides,  in  each  iteration,  the 
processor  keeps  a  record  of  those  destination  elements  that  correspond  to 
patches  assigned  to  a  different  processor. 

4.  Once  the  local  calculation  of  radiosity  in  one  iteration  is  completed,  the  pro¬ 
cessors  start  a  global  communication  phase  in  which  radiosity  values  and 
tree  structures  are  updated.  In  this  phase,  each  processor  sends  and  re¬ 
ceives  data  from  the  other  processors,  following  an  all-to-all  communication 
pattern  implemented  by  MPI  total  exchange  routines  (MPI Jilltoall  and 
MPIJUltoallv). 

5.  After  the  communication  stage  of  each  iteration,  convergence  is  checked  in 
parallel  by  means  of  a  reduction  operation  (MPI  JTlreduce).  Each  processor 
contributes  the  partial  radiosity  of  its  set  of  assigned  patches  to  the  reduc¬ 
tion  and,  this  way,  the  whole  radiosity  of  the  scene  is  obtained.  Next,  each 
processor  compares  this  value  with  the  radiosity  in  the  previous  iteration. 
As  in  the  sequential  code,  if  the  difference  is  less  than  a  fixed  threshold,  the 
iterative  algorithm  ends. 

3.2  Distributed  Dynamic  Scheduling 

The  irregular  and  unpredictable  behaviour  in  the  execution  of  the  hierarchical 
method  makes  the  parallelisation  using  a  static  load  distribution  inappropria¬ 
te,  due  to  the  appearance  of  load  imbalance.  Although  we  tried  to  overcome 
this  problem  by  assigning  cyclically  a  list  of  patches  in  order  of  area,  it  is  not 
enough.  A  further  approach  could  be  a  patch  reassignment  at  the  end  of  the  first 
iteration  based  on  the  number  of  interactions  of  each  patch.  But  this  approach 
would  not  be  very  useful  because  we  have  experimentally  checked  that  the  first 
iteration  is  the  most  time-consuming  and,  therefore,  it  is  necessary  to  solve  the 
load  imbalance  from  the  beginning  of  the  algorithm  (specifically  during  the  first 
iterarion).  With  this  aim  in  view,  we  have  developed  a  second  parallel  algorithm 
that  implements  a  dynamic  load  distribution.  The  parallel  algorithm  (see  Fig.  3) 
is  summarized  as  follows: 

1.  Each  processor  builds  its  own  BSP  tree  with  all  the  patches  of  the  scene, 
and  sorts  them  in  decreasing  order  of  area. 

2.  The  patches  are  cyclically  distributed. 

3.  Each  processor  computes  the  radiosity  of  the  assigned  patches. 

4.  In  the  first  iteration,  if  a  processor  finishes  its  corresponding  computations, 
the  next  step  is  to  check  the  presence  of  non-processed  patches  in  the  ordered 
global  list.  If  so,  the  processor  takes  a  set  of  k  patches  from  the  list,  being  k  a 
parameter  that  is  predefined  experimentally  depending  on  the  problem  size 
and  the  communication  cost.  We  must  take  into  account  that  high  values 
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INPUT  POLYGONS 


Fig.  3.  Diagram  of  the  algorithm  using  a  distributed  dynamic  scheduling 


of  k  reduce  the  communications  overhead  of  the  scheduling  at  the  expense 
of  a  worse  load  balance,  and  vice  versa.  Step  4  is  repeated  until  the  list  of 
non-processed  patches  is  empty. 

o.  Once  the  radiosity  of  all  the  patches  of  the  scene  is  computed,  the  commu¬ 
nication  phase  takes  place. 

6.  The  convergence  of  the  algorithm  is  tested  in  parallel  and,  in  case  of  success, 
the  algorithm  finishes. 

7.  For  the  next  iterations,  each  processor  uses  the  same  patches  as  in  the  first 
iteration  (both  the  patches  assigned  statically  and  the  ones  taken  from  the 
list).  At  the  end  of  each  iteration,  it  returns  to  step  5. 

As  can  be  observed,  the  only  significant  difference  between  the  static  and 
dynamic  implementations  lies  in  the  first  iteration  of  the  algorithm,  specifically 
in  the  fourth  step  of  the  dynamic  version.  In  this  step,  seemingly  simple,  the 
scheduling  is  carried  out.  The  main  drawback  of  this  scheduling  lies  in  the  fact 
that  two  or  more  processors  could  compete  for  the  same  patch.  Next,  we  describe 
m  detail  the  protocol  we  have  developed  to  overcome  this  problem. 


Scheduling  Protocol.  In  order  to  carry  out  a  dynamic  patch  allocation,  each 
processor  must  keep  updated  information  about  those  patches  that  have  not 
been  still  processed.  This  information  is  stored  in  the  ordered  list  of  patches  and 
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Processing 
patch  21 
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Fig.  4.  Practical  example  of  the  protocol  (assuming  that  k= 1):  processors  1  and  3 
compete  for  patch  41,  but  only  processor  1  gets  the  patch 


must  be  available  in  all  processors.  As  we  are  working  in  a  distributed-memory 
environment,  this  availability  is  achieved  by  means  of  message-passing.  Thus, 
before  processing  a  set  of  patches,  each  processor  communicates  this  state  to 
the  rest  of  processors.  A  drawback  arises  when  two  or  more  processors  compete 
for  the  same  patch.  To  avoid  the  assignment  of  the  same  patch  to  different 
processors,  we  have  implemented  a  protocol  based  on  making  requests  about  the 
state  of  the  patch  that  causes  the  conflict. 

In  the  fourth  step  of  the  parallel  algorithm  described  in  this  subsection,  a 
processor,  before  taking  a  “free”  patch  (a  patch  that  has  not  been  still  processed), 
sends  a  request  message  to  the  owner  of  that  patch,  that  is,  the  processor  that 
has  the  patch  by  means  of  the  static  assignment  of  step  2  (which  is  known  by  all 
the  processors).  If  the  owner  of  the  patch  is  not  still  processing  it,  the  ownership 
of  the  patch  is  transferred  to  the  requesting  processor  (ACK)  provided  that  the 
patch  had  not  been  still  given  to  other  processor.  Otherwise,  the  patch  request  is 
refused  (NACK).  Note  that,  in  this  case,  explicit  messages  are  not  used  because 
the  processor  that  is  taking  charge  of  the  patch  communicates  this  situation  to 
the  rest  of  processors. 

Using  this  protocol,  any  kind  of  incoherence  arising  from  the  multiple  assign¬ 
ment  of  one  patch  to  two  or  more  processors  is  avoided.  For  example,  in  Fig.  4 
it  can  be  observed  that,  once  processors  1  and  3  have  finished  the  computations 
associated  with  their  assigned  patches,  they  search  for  free  patches  in  the  or- 


FEUP  -  Faculdade  de  Engenharia  da  Universidade  do  Porto 


Fig.  5.  Speedups  for  the  test  scene  (650  polygons) 


dered  list  of  patches,  beginning  from  the  last  patch  of  that  list.  Both  processors 
try  to  get  patch  41  (initially  assigned  to  processor  2),  but  only  processor  1  will 
finally  get  it;  processor  3  carries  on  with  the  search  of  free  patches  in  the  list. 

During  this  scheduling  stage,  nonblocking  communications  (both  send  and 
receive  primitives)  are  used  to  overlap  communication  and  computation.  Besides, 
as  the  messages  to  be  sent  in  this  stage  have  the  same  format  and  size,  as  well  as 
the  same  destinations,  we  have  used  persistent  communications  by  means  of  MPI 
routines:  MPI_Send_init,  MPI _Recv_init,  MPI_Start  and  MPIJtequest Jree. 

herefore,  the  tasks  involved  in  setting  up  the  communication  are  accomplished 
only  once. 


4  Experimental  Results 

We  have  tested  the  parallel  algorithms  on  the  Fujitsu  AP3000  multicomputer  [7] 
whose  nodes  (UltraSparc-II  processors  at  300  Mhz)  are  connected  via  a  high¬ 
speed  commumcacion  network  (AP-Net)  in  a  2D  torus  topology.  The  test  scene 
is  composed  of  650  input  triangles  and  is  depicted  in  Figs.  6  and  7. 

^  The  results  in  terms  of  speedups  are  shown  in  Fig.  5,  both  for  a  static  patch 
assignment  and  for  the  dynamic  scheduling  approach  (using  k=l  and  &=64, 
and  up  to  12  processors).  The  execution  time  of  the  sequential  algorithm  is  374 
seconds,  and  it  is  42.68  using  the  dynamic  scheduling  on  8  processors.  As  can  be 
observed,  the  speedup  for  the  static  case  tends  to  be  constant  from  8  processors 
upwards  due  to  the  effect  of  load  imbalance.  The  speedups  are  greatlv  improved 
using  the  dynamic  scheduling  that  balances  the  load.  This  improvement  is  not  so 
good  for  high  values  of  k  in  relation  to  the  whole  scene  size  (for  instance,  k— 64 
m  our  example)  because,  although  the  number  of  communications  decreases 
the  load  imbalance  becomes  more  noticeable  and,  therefore,  the  speedup  results 
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come  close  to  the  static  approach.  For  k  <64,  we  have  experimentally  checked 
that  the  results  are  very  similar  to  the  ones  obtained  for  k=  1. 

As  can  be  observed,  although  better  speedups  are  achieved  by  using  the 
dynamic  scheduling  (for  fc=l),  from  a  certain  number  of  processors  (10  in  our 
example  scene)  the  speedup  does  not  rise  accordingly.  This  is  because  the  local 
computations  assigned  to  each  processor  are  not  significant  and  it  is  not  worth 
balancing  small  loads  due  to  the  communications  overhead.  Better  speedups  are 
expected  for  larger  scenes  because  they  involve  more  computations  and,  thus,  the 
associated  execution  times  are  very  high  in  relation  to  the  communication  factor. 
According  to  the  speedup  results  we  can  conclude  that  the  algorithm  presents 
a  reasonable  scalability  and  the  larger  the  scenes  are,  the  more  appropriate  the 
dynamic  scheduling  is. 

Regarding  the  correctness  of  the  algorithm  results  (see  the  illuminated  scene 
in  Fig.  6  and  the  scene  divided  into  elements  in  Fig.  7),  we  have  used  the  residual 
error  of  the  radiosity  as  error  metric.  We  have  experimentally  confirmed  that  the 
error  measured  in  the  parallel  implementations  does  not  vary  in  comparison  with 
the  sequential  version. 


5  Conclusions 

In  this  paper  we  have  described  a  parallel  implementation  of  the  hierarchical  ra¬ 
diosity  method  on  distributed-memory  architectures.  The  parallel  method  gene¬ 
rates  an  irregular  load  distribution,  which  can  be  balanced  following  two  strate¬ 
gies:  on  the  one  hand,  the  patches  are  initially  distributed  by  area,  trying  to 
assign  the  same  number  of  computations  to  each  processor;  on  the  other  hand,  a 
distributed  scheduling  performs  a  finer  tuning  to  balance  the  load  dynamically, 
by  reassigning  the  smallest  non-processed  patches  to  the  processors  that  finish 
their  work.  Good  speedups,  load  balance  and  an  acceptable  scalability  have  been 
achieved  through  this  approach. 

We  conclude  that  distributed-memory  architectures  can  be  efficiently  used 
to  implement  the  hierarchical  radiosity  method,  although  the  main  drawback  is 
the  memory  overhead  derived  from  the  replication  of  the  BSP  tree,  as  well  as 
part  of  the  hierarchical  structures  of  the  elements,  in  each  processor. 

As  future  work,  we  intend  to  study  alternative  representations  of  the  input 
3D  scene  to  avoid  data  redundance.  We  also  expect  to  improve  the  iterative  pro¬ 
cess  for  the  radiosity  computation,  both  to  reduce  execution  times  and  memory 
requirements.  Specifically,  we  will  focus  on  the  process  to  determine  visibility, 
which  is  currently  implemented  by  means  of  the  ray-casting  algorithm  [5].  Our 
goal  is  to  decrease  the  amount  of  time  spent  testing  rays  against  the  environment. 
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a)  b) 

Fig.  6.  a)  Scene  before  applying  the  hierarchical  radiosity  algorithm,  b)  Illuminated 
scene  after  applying  the  parallel  algorithm 
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Abstract.  This  paper  describes  efficient  parallel  algorithms  for  low  and  inter¬ 
mediate  level  vision  for  a  Linear  Array  of  Processors  with  Multi-mode  Access 
Memory  (LAPMAM).  Its  special  memory  and  its  singular  SIMD/restricted 
MIMD  mode,  combined  with  the  parallel  search  and  multiple  update  operation 
of  the  memory  modules  make  LAPMAM  very  efficient  in  real  time  image  proc¬ 
essing.  We  have  developed  fast  parallel  algorithms  to  determine  labeling,  area 
and  perimeter  determination,  histogram  and  median  filter,  we  have  taken 
advantage  of  LAPMAM  characteristics  for  developer  this  efficient  algorithms. 
The  architecture  and  the  algorithms  were  tested  in  language  C  and  in  a  hard¬ 
ware  simulator.  The  performance  obtained  are  compared  with  that  of  different 
architectures. 


1  Introduction 

The  computational  demands  of  computer  vision,  which  requires  to  process  an  enor¬ 
mous  amount  of  information  have  incited  a  large  number  of  research  work  and  led  to 
numerous  architectures  and  algorithms  [1]  [2]  [3].  The  Sequential  machines  require 
an  excessive  amount  of  time.  Hence  this  problem  generally  lies  well  beyond  the  ca¬ 
pacity  of  existing  sequential  processors.  Consequently,  the  possibility  of  the  parallel¬ 
ism  has  been  highly  exploited. 

In  view  of  the  number  of  processors  and  their  topologies,  parallel  architecture  may  be 
classified  into  three-dimensional  arrays  (Pyramid,  Hypercube,  etc.),  two-dimensional 
arrays  (CLIP,  Mesh  with  Reconfigurable  Mesh,  etc.),  and  one-dimensional  array 
processors.  The  Electronic  Instrumentation  Laboratory  of  Nancy  France  is  developing 
a  linear  array  processor  architecture  for  low  an  intermediate  level  vision  that  enhance 
its  parallelism  using  a  Multi-mode  Access  Memory  (MAM)  and  a  tree  interconnec¬ 
tion  network.  Also,  its  SIMD/restricted  MIMD  operating  mode  allows  to  LAPMAM 
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to  switch  between  SIMD  and  MIMD  mode  automatically  with  a  simple  initial 
programming.  In  this  architecture,  a  new  concept  of  SIMD/restricted  MIMD 
processors  is  also  proposed.  The  processor  has  the  SIMD  structure  with  its  typical 
advantages  (simple  implementation,  high  performance  and  no  memory  access 
conflicts),  but  also  can  work  like  a  MIMD  processor,  taking  a  limited  conditional 
decision  with  a  simple  control  logic. 

This  article  shows  some  fast  parallel  algorithms  developed  to  take  advantage  of  the 
LAPMAM  characteristics  [4],  We  are  obtained  a  quasi-optimal  processor  x  time 
complexity  [5]  to  the  intermediate  level  vision  algorithms.  Concerning  to  the  median 
filtering  low  level  algorithm,  the  typical  complexity  of  O(n)  to  the  1-d  architectures  is 
obtained,  but  we  show  how  the  LAPMAM  enhances  the  parallelism  using  the 
interprocessor  communication  to  reduce  the  pixel  computing  operations.  In  the 
following  section  we  present  the  organization  of  the  LAPMAM.  Then,  we  describe 
the  algorithms  developed.  Finally  we  follow  up  with  a  discussion  of  the  simulation 
results  and  a  comparison  of  different  architectures  before  concluding 


2  LAPMAM  architecture 

The  LAPMAM  is  a  linear  array  of  RISC  SIMD/restricted  MIMD  processors  with  a 
ulti-mode  Acces  Memory.  The  LAPMAM  has  four  memory  planes  that  are  de¬ 
pendent  tanks  to  the  bi-directional  heteroassositive  CAM  property.  A  controller  gives 
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the  instructions  word  to  each  PE.  The  PE-PE  and  PE-MAM  communications  are 
carry  out  by  a  tree  interconnection  network  and  by  a  local  bus  (Memory  bus)  between 
the  PEs  and  its  corresponding  row  of  memory  modules. 

The  LAPMAM  architecture  for  a  512  x  512  image  (n=512)  is  shown  on  Figure  1.  It 
features  n  processors  organized  in  a  linear  array.  Each  of  them  is  connected  to  a  row 
of  n  memory  modules.  A  special  interconnection  network  allows  every  processor  to 
reach  any  of  the  other  processors  and  their  associated  memory  rows.  This  network 
presents  a  tree  structure  and  ensures  global  communication  in  0(log  n)  units  of  propa¬ 
gation  time. 

LAPMAM  has  four  identical  memory  planes  of  log  512  bits  denoted  Ma1[ijJ,  Ma2LujJ, 
M  [ij]  and  M„,[i,j]  (0  <  i,j  <511).  Each  plane  consists  of  512  rows,  each  containing 
512  memory  modules.  The  four  planes  can  be  turned  into  two  planes  MA[i,j]  and 
MB[i,j]  of  2  log  512  bits.  On  Figure  1  the  planes  MAI,  MA2,  MB1,  MB2  are  represented  by 
the  memory  modules  (M). 


2.1  The  Multi-mode  Access  Memory  (MAM) 

Our  MAM  module  is  basically  a  modified  CAM.  The  CAM  is  a  memory  with  ad¬ 
dressing  based  on  its  content.  This  is  an  excellent  solution  in  some  applications  where 
the  RAM,  with  addressing  based  on  its  location,  shows  limited  performance.  The 
main  advantage  of  the  CAM  is  its  capability  to  write/read  a  data  to/from  multiple 
locations  in  only  one  clock  cycle  or  0(1)  time.  Despite  its  relatively  high  cost,  CAM 
has  found  since  then  enormous  importance  in  various  applications  like  data  base 
management  and  image  processing  [6]. 

The  CAM  enhances  the  parallelism  of  an  architecture  because  this  memory  works 
inherently  in  parallel.  However,  its  utilization  reduces  the  processing  flexibility  since 
the  CAM  can  not  be  addressed  by  its  position  and  the  CAM  reading  is  difficult.  We 
have  designed  a  CAM  based  memory  with  the  possibility  RAM  and  FIFO  to  solve  the 
limitation  of  the  CAM  pure,  it  was  called  Multi-Mode  Access  Memory.  The  MAM 
modules  constitute  either  four  log  n  bits  wide  or  two  2  log  n  bits  wide  memory 
planes.  The  four  planes  enable  the  architecture  to  work  with  algorithms  that  need  to 
store  intermediate  results.  The  image  loading  procedure  is  also  made  simpler  thanks 
to  this  possibility:  an  image  frame  may  be  stored  in  one  memory  plane  while  the 
previous  is  still  under  processing.  The  size  of  the  memory  words  depends  on  the  algo¬ 
rithms  being  run  (2  log  n  bits  for  labeling  and  log  n  bits  for  median  filtering  for  ex¬ 
ample).  The  CAM  and  RAM  operation  can  be  carried  out  in  a  whole  plane,  in  a  row 
(PE-MAMs)  or  in  several  rows  of  a  plane.  The  FIFO  operation  is  only  carried  out  in 
the  couples  PE-MAMs. 

Writing  in  normal  CAM  mode  consists  of  simultaneously  updating  all  the  memory 
plane  elements  (M)  with  a  New_Data  where  its  content  is  equal  to  a  Target_Data.  The 
following  algorithm  describes  the  normal  CAM  mode: 
forall  M[address]  ( 0  <  address  <  n-1  )  do_in_parallel 
if  (  M[address]  =  =  Target_Data ) 

Mjaddress]  =  New_Data; 
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endif 

endforall 


WTn8Jn  in‘eract[ve  CAM  mode  consists  of  updating  elements  of  a  memory  plane 
with  a  New_Data  if  the  content  of  the  elements  of  a  different  plane  is  equal  to  a  Tar- 
get_Data.  The  interactive  CAM  mode  is  of  highest  interest  as,  in  this  mode,  two 
planes  can  be  dependent  on  one  another.  On  Figure  2  a  PE  addresses  plane  A  with  a 
target  Data=l  corresponding  to  "objet  1")  to  update  the  corresponding  memories  in 

plane  B  with  a  "blue"  data  in  0(1)  time. 


Figure  2:  The  capability  of  a  PE  for  writing  to  multiple  rows  in  the  interactive  CAM  mode,  in 


The  FIFO  mode  is  used  to  perform  a  circular  left/right  data  movement  in  a  MAM-PE 
row  Two  clock  cycles  are  required  to  transfer  the  four  planes.  This  mode  allows  the 
E  to  get  a  neighboring  information  between  the  memory  modules  which  is  absent  in 
ordinary  CAM  cells.  Furthetmore,  the  PE  being  part  of  the  FIFO,  it  can  read  a  new 
memoty  data  at  the  same  time  that  it  writes  in  its  memory  row  the  data  processed.  The 
FIFO  mode  thus  allows  dividing  the  number  of  data  access  by  two. 

The  RAM  mode  is  obtained  using  the  interactive  CAM  mode.  A  different  address 
must  be  stored  in  one  plane  of  each  memory  module.  A  subsequent  interactive  CAM 
operation  with  the  desired  location  will  only  activate  one  memory  module.  The  MAM 

planes  not  used  to  store  the  address  may  then  be  either  read  or  written  using  the  inter- 
active  CAM  mode. 


2.2  The  processing  element  (PE) 

The  processing  element  is  a  RISC  SIMD  processor  with  the  possibility  of  take  some 
decisions.  Each  PE  can  be  activated  or  deactivated  independently.  The  PE  is  able  to 
compute  a  basic  logical  or  arithmetic  operation  in  0(1)  time.  It  can  communicate  in 
0(1)  units  of  propagation  time  with  its  adjacent  PEs  or  with  its  associated  memory 
modules.  Furthermore,  it  can  communicate  in  0(log  n)  units  of  propagation  time 
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either  with  its  non-adjacent  PEs  or  its  non-adjacent  memory  modules  through  the 
interconnection  network. 

When  the  processor  is  connected  directly  to  its  memory  row,  the  access  to  the  data 
contained  in  this  line  is  accomplished  by  means  of  the  FIFO,  RAM  and  CAM  modes. 
In  the  FIFO  mode  the  data  of  the  PE  are  transferred  to  the  last  MAM  module  of  the 
row. 

Using  the  tree  interconnection  network,  a  processor  can  be  connected  to  several  mem- 
ory  rows  or  even  to  all  memory  rows.  This  depends  on  the  interconnection  network 
programming.  To  enable  the  communication  between  PEs,  each  PE  has  a  data  output 
toward  its  adjacent  PEs  (upper  PE,  lower  PE). 


2.3  Restricted  MIMD  mode 

A  SIMD  PE  is  characterized  by  its  reduced  size.  But,  because  it  does  not  have  a  unit 
control,  these  types  of  PEs  can  not  take  internal  decisions.  Then,  to  execute  different 
operations  on  different  data,  an  architecture  SIMD  has  to  connect  and  disconnect  the 
processor  as  many  times  as  the  number  of  different  operations.  On  the  other  hand,  the 
MIMD  processor,  that  has  a  unit  control,  can  take  internal  decision,  but  they  are  very 
much  complex.  It  limits  the  number  of  PEs  in  an  integrated  circuit.  The  LAPMAM 
architecture  has  a  SIMD  processor  that  can  take  some  internal  decisions,  this  possibil¬ 
ity  increment  the  flexibility  of  this  architecture  avoiding  the  connection  and  discon¬ 
nection  of  PEs  pour  perform  different  instructions,  reducing  the  computing  time.  This 
particular  characteristic  is  called  by  us  restricted  MIMD  mode  because  the  processor 
can  only  take  a  few  decisions. 


2.4  The  interconnection  network 

The  LAPMAM  interconnection  network  performs  the  communications  PE-PE  or  PE- 
MAMs  in  0(log  n),  but  in  some  case  it  can  be  executed  in  0(1),  possibility  that  we 
are  exploited  in  our  algorithms.  Moreover,  this  network  has  the  characteristics  of 
modularity  and  extensibility  that  allow  to  the  network  to  be  constructed  fro111  a  small 
set  of  basic  modules  and  to  be  extended  to  a  larger  size.  These  possibilities  are  very 
interesting  for  a  VLSI  implementation. 

The  interconnection  network  is  reconfigurable  by  n+(3n/4)-l  switch  modules  denoted 
S  .  Each  switch  S  contains  (4  log  512  +  2)  three  states  buffers.  The  PE-PE  and  PE- 
MAMs  connections  can  be  carried  out  in  regions.  Some  PEs  can  be  connected  to  a 
region  of  2,  4,  8  etc.  elements  (PEs  or  rows  of  MAMs).  This  connection  allows  cer¬ 
tain  PEs  to  do  a  regional  or  global  communication  in  0(1)  with  a  propagation  delay 
of  O(log  n).  But,  in  general,  a  global  communication  time  of  0(log  n)  is  obtained 
with  this  type  of  network.  A  tree  interconnection  network  for  an  architecture  with 
eight  PE  is  presented  on  Figure  3. 
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T ree  structure  interconection  network  Processing 


Figure  3:  Tree  structure  interconnection  network  for  LAPMAM  architecture  with  8  PE  and 


3  Algorithms 

This  section  present  the  efficient  parallel  algorithms  used  to  evaluate  the  LAPMAM 
architecture.  The  algorithms  developed  are  connected  components  labeling,  area  and 
perimeter  of  a  region,  histogramming  and  median  filtering.  A  description  of  these 
algorithms  for  an  image  n  x  n  is  done  in  the  following  paragraphs. 


3.1  The  connected  components  labeling 

Labeling  consists  in  assigning  a  unique  label  to  the  connected  components  in  the 
image.  It  is  a  fundamental  task  in  image  processing  and  a  lot  of  architectures  and 
algorithms  have  been  created  to  solve  this  problem  [5],  Our  algorithm,  which  is  based 
on  a  divide  and  conquer  technique,  leads  to  a  complexity  of  0(n  log  n).  We  remark 
that  this  complexity  is  independent  of  the  shape  of  object  and  the  type  of  image,  it 
could  be  black  and  white  or  level  of  gray,  a  4-connectivity  image  is  assumed  We 
suppose  that  an  initial  image  is  available  in  MA[ij]  while  the  M„[ij]  plane  is  initial¬ 
ized  at  0.  This  algorithm  is  comprised  of  two  stages  as  follows. 

Row  processing:  The  values  of  MA[ij]  and  MA[i+l  j]  of  a  given  row,  starting  with 
i-0,  are  tested.  If  both  are  identical,  the  MB[i+lj]  is  assigned  by  the  MJij]  value. 
Otherwise,  the  M„[i+1  j]  is  assigned  by  its  row+1  value.  This  operation  is  done  in 
parallel  for  each  row  and  is  repeated  by  first  incrementing  the  index  i.  At  the  end  of 
row  processing  (i=n),  each  objet  is  labeled  according  to  its  smallest  value.  The  value 
in  each  rowN(Bj  represents  a  label  of  pixel  in  rowMAj  .  The  complexity  for  this  stage  is 
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:  modified  value,  these  values  change  simultaneously 
—  :  (WO  regions,  C  :  merging  operation,  Q:  processors 


Figure  4:  Labeling  example 


Merging:  The  values  of  MA[ij]  and  MA[i,j+l]  of  two  given  rows,  starting  with  i=0, 
are  tested.  If  they  are  equal,  the  largest  value  addresses  row.18)  and  row^,,  in  normal 
CAM  mode  to  update  all  CAMs  in  the  2  rows  with  the  smallest  value.  This  is  called 
the  broadcast  mode  and  takes  0(1)  time.  Otherwise,  there  is  no  operation.  The  opera¬ 
tion  is  carried  out  for  each  row  in  parallel  and  is  repeated  by  incrementing  i  until  the  2 
rows  are  merged  (i=n).  The  2  merged  rows  are  henceforth  called  region.  The  merging 
is  repeated  by  activating  the  following  stage  of  the  tree  structure  to  form  2  larger 
regions.  Two  adjacent  boundaries  of  both  regions  are  then  scanned  and  compared. 
The  same  procedure  as  that  of  above  processing  is  undertaken.  The  largest  and  small¬ 
est  values  are  then  transferred  in  the  broadcast  mode  but  this  time  in  order  to  merge 
the  2  regions  instead  of  the  2  rows.  The  merging  is  repeated  until  the  last  stage  of  the 
tree  structure.  The  total  number  of  stages  reaches  log  n  and  each  merge  takes  n  itera¬ 
tions,  it  so  follows  that  the  total  merging  takes  0(n  log  n)  time.  At  the  end  of  the 
procedure,  the  background  is  defined,  with  a  simple  CAM  operation  all  the  objects 
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with  the  value  chosen  become  the  background,  they  will  be  labeled  0.  The  Figure  4 
shows  a  labeling  example,  it  considers  only  an  object  for  clear  demonstration  The 
initial  image  is  shown  in  the  first  frame,  the  following  two  frames  (left  to  right)  pre¬ 
sent  the  row  processing.  In  this  part  there  is  a  circular  data  in  each  row  that  is’ exe¬ 
cuted  by  the  FIFO  instruction.  Each  data  pass  by  the  PE  for  be  treated  until  all  the 
data  get  their  original  organization.  The  Merging  stage  is  executing  using  also  the 
function  FIFO  to  provide  the  data  memory  to  the  each  PE.  The  data  actualization  is 
done  by  the  CAM  function  as  is  shown  in  all  the  merging  frames.  The  three  stages 
(log  8)  of  merging  are  essentials  for  processing  an  8x8  image.  The  final  result  (image 
labeled)  is  presented  in  the  last  frame. 

The  algorithmic  description  of  our  labeling  method  is  presented  in  the  following 
paragraphs.  The  used  terminology  is: 

M[i  j] :  memory  module  in  column  i  and  row  j. 

M[*j]  :  memory  module  in  all  columns  and  row  j. 

’  ■' CAM  mode. 

In  the  interactive  CAM  mode,  writing  MA[*,  j]  with  a  New_Data  if  M J* j]  is  addressed  by 

a  Target_Data  is  shown  as  follows: 

forall  CAMs  MB(CAM1[*,  j],  (  0  <j  <  n-1  )  do_in_parallel 
if  j]  =  =  Target_Data,  ) 

Ma,ca,m.[*'  j]  =  New_Data,  ; 

endif 

enforall 


Algorithm:  Region  Labeling 

Input:  Initial  image  in  MA[ij], 

Output:  Labeled  image  in  MJijj. 

Initialization  of  memory: 

forall  Memory  M„[i  j],  (0  <  i  <  n-1, 0  <  j  <  n-1)  doJn_parallel 
M„[i  j]=0 ; 

Row  processing: 

forall  Processors  P. ,  (  0  <  j  <  n-1  )  do_in_parallel 
for  ( i=0  ;  i  <  n-1  ;  i++  ) 

if  (MJi j]  =  =  MA[i-l  j] )  MJij]  =  M„[i-1  j]; 

else 

M„[id]  =  i+nxj+l; 

//  i+nxj+1:  row  major  initialization  (1,  2,  3,..,  n) 

endif 

endfor 

endforall 

Merging: 

for  (f=l ;  f  <  log  n  ;  f++) 
for  (i  =0  ;  i  <  n- 1 ;  i++) 
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forall  Processors  P,,  (L=2'  '(2k-l)-l,  1  <  k  <  n/2 )  do_in_parallel 
if  (  MJi.L]  =  =  MA[i,L+l] ) 
if  (  M„[i,L]  <  M„[i,L+l] ) 

Target_Data,  =  M„[i,L+l]; 

New_Data,  =  M„[i,L]; 

else 

Target_Data,  =  MB[i,L]; 

New_Data,  =  MI)[i,L+ 1  ]; 

endif 

forall  CAMs  MIi(rAM,[*,  r+2r(k-l)], 

( 0  <  r  <  2 -1 ,  1  <  k  <  n/2' )  do_in_parallel 
if  (  M])iCAM1[*,  r+  2'(k- 1 )]  =  =  TargeCData,  ) 

M  r+  2  (k- 1 )]  =  New_DataI  ; 

endif 

endforail 

endif 

enforall 

endfor 

endfor 

Background  definition: 

forall  CAMs  Mb(CAM)[*,  j],  (  0  <j  <  n-l)  do_in_parallel 
if  (  MB(rAM,[*,  j]  =  =  0  ) 

endif 

endforail 


3.2  Area  or  perimeter  determination 

These  two  algorithms  are  very  similar  to  the  precedent  one.  They  also  use  the  divide- 
and-conquer  technique  and  are  carried  out  in  the  same  two  phases.  The  only  differ¬ 
ence  is  the  type  of  processing  which  affects  the  pixels.  For  the  area  determination,  all 
the  pixels  with  a  given  label  are  counted  while  only  those  situated  on  the  boundary  of 
the  connected  components  are  taken  into  account  in  the  case  of  perimeter  determina¬ 
tion.  Here  again,  these  algorithms  have  a  complexity  of  0(  n  log  n). 


3.3  Histogramming 

The  histogram  of  an  image  is  defined  as  the  total  number  of  pixels  belonging  to  each 
gray-level  value.  For  the  histogram  determination  we  use  the  organization  technique 
of  results  proposed  by  Alnuweri  [7].  In  this  algorithm,  an  initial  image  is  supposed  to 
be  available  in  the  A1  plane.  The  B1  plane  that  is  initialized  at  0  is  used  to  store  the 
result  of  histogramming  in  which  its  columns  correspond  to  the  gray-level  values 
while  its  rows  correspond  to  the  number  of  pixels. 
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Row  processing:  Here,  at  each  iteration  i,  each  PE  reads  a  gray-level  value  P  in  the  i1" 
column  of  the  A1  plane.  The  value  Q,  in  the  A'h  column  of  the  B1  plane,  is  then  in¬ 
cremented.  Since  each  incrementing  operation  takes  0(1)  time  and  there  are  n  itera¬ 
tions,  therefore  this  phase  takes  0(n)  time. 

Sum-on-tree:  Here,  at  each  iteration  i,  each  PE  reads  the  value  of  the  i1"  column  in  the 
B1  plane.  The  sum-on-tree  operation  is  employed  to  add  all  values  stored  in  the  PEs. 
Since  each  sum-on-tree  operation  takes  0(log  n)  time  and  there  are  0(n)  iterations 
this  phase  takes  0(n  log  n)  time. 

Hence,  the  complexity  of  our  histogramming  is  0(n  log  n)  which  is  optimal  for 
G<n,  where  G  is  the  number  of  the  gray-level  value. 


3.4  Median  filtering 

Median  filtering  consists  of  replacing  each  pixel  of  a  given  image  by  the  median  of 
the  pixels  contamed  in  a  window  centered  around  that  pixel  [8],  This  filtering  opera¬ 
tion  is  useful  in  removing  isolated  lines  or  pixels  while  preserving  spatial  resolution, 
the  classic  method  consists  of  sorting  the  elements  from  the  smallest  to  the  largest  in 
a  value  table.  The  5'  element,  in  the  case  of  9  elements,  will  be  the  median  value.  To 
sort  them,  comparisons  of  two  by  two  elements  are  done  executing  a  permutation  to 
change  their  place  in  the  table.  In  this  method,  all  the  window  pixels  are  accumulated 
m  the  PE  registers.  The  PE  uses  others  registers  to  execute  the  sorting  operations  and 
to  store  the  results.  We  propose  a  fast  filter  median  3x3  algorithm  that  uses  only  some 
PE  registers  thanks  to  the  interprocessor  communication.  This  algorithm  can  be  ex¬ 
tended  to  larger  windows.  The  pixels  of  the  window  are  distributed  as  is  shown  on 
Figure  5. 

The  algorithm  consists  of  three  steps:  first,  sorting  in  parallel  the  3  data  groups  stored 
-  in  the  PEs  from  the  smallest  to  the  largest.  In  the  second  step,  the  largest  pixel  be¬ 
tween  the  smallest  group  of  each  row  is  found.  In  the  same  way,  the  smallest  pixel 
among  the  largest  ones  of  each  row  is  found  and  the  median  of  the  medians  group  is 
detected  too.  The  final  stage  consists  of  detecting  the  median  value  of  the  diagonal 
integrated  by  the  elements  found  in  the  second  part,  it  will  be  the  final  median  value 

of  this  group  of  pixels.  Figure  5  shows  the  pixel  array  and  the  three  stage  of  the 
method. 

The  relevant  characteristics  of  this  method  are  that  the  classification  of  the  three  ele¬ 
ments  obtained  by  a  PE,  in  the  first  step,  is  used  by  its  two  immediate  PE  neighbors 
Then,  the  median  filter  operations  of  this  part  are  divided  by  three.  The  second  step 
consists  only  in  a  few  operations  because  it  is  not  necessary  to  sort  the  whole  column. 

It  is  enough  to  find  the  smallest,  largest  and  median  value  in  the  corresponding  col¬ 
umns.  It  is  the  same  case  for  the  third  step:  the  final  median  value  is  detected  with  4 
comparisons.  In  conclusion,  the  LAPCAM  system  realizes  the  3x3  median  filtering  in 
O(n)  steps,  as  many  other  linear  architectures,  but  our  algorithm  permits  to  process 
each  pixel  in  only  sixteen  operations. 
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Figure  5:  Pixel  classification  window  used  to  find  the  3x3  median  value. 


4  Performance 

To  validate  these  algorithms  we  have  done  a  hardware  simulation  using  VLSI  tools  of 
a  LAPMAM  prototype  with  8  PE.  All  the  algorithms  mentioned  above  were  imple¬ 
mented  on  this  prototype.  The  architecture  simulation  was  done  at  50  MHz  fre¬ 
quency.  All  the  algorithms  were  simulated  at  this  frequency,  the  results  were  ex¬ 
tended  to  an  512  PE  architecture,  they  are  showed  in  Table  1.  LAPMAM  computes 
these  low  and  intermediate-level  image  processing  algorithms  much  faster  than  the 
video  rate.  The  best  performance  results  of  the  DARPA  II  image  understanding 
benchmark  [9]  for  the  algorithms  evaluated  are  compared  in  the  first  part  of  the  Table 
2.  The  architectures  included  are  the  Connection  Machine  (CM)  with  64  K  of  PE,  the 
Associative  String  Processor  (ASP)  that  has  262,144  processors  and  the  Image  Un¬ 
derstanding  Architecture  (IUA)  that  consists  of  three  difference  processors:  low  level 
SIMD  PEs  (processor-per-pixel),  4096  intermediate  level  SIMD/MIMD  16  bits  proc¬ 
essors,  and  one  high  level  multiprocessor.  For  the  tasks  compared,  our  architecture  is 
among  the  best  ones  while  being  the  least  complex.  On  this  benchmark,  only  IUA  has 
better  results  for  labeling.  However,  it  features  for  many  more  processors  than  our 
architecture.  Otherwise,  LAPMAM  has  the  best  computation  times.  This  does  not 
necessary  mean  that  our  architecture  is  much  better  than  the  others,  since  these  archi¬ 
tectures  are  very  different  and  the  technology  evolution  is  not  considered.  Neverthe¬ 
less,  it  gives  a  good  idea  of  the  LAPMAM's  potential  in  low  and  intermediate  level 

tasks. 
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Table  1:  LAPMAM  estimated  performance  fora  512x512  image 


Agorithm 

mmm 

Labeling 

30737 

614.74 

Area  of  a  region 

0(n  log  n) 

43548 

870.96 

Perimeter  of  a  region 

46091 

921.82 

Histogram 

21015 

420.3 

Median  filter 

°fr) 

10241 

204.82 

In  the  second  part  of  the  Table  2,  the  LAPMAM  estimated  performance  is  compared 
with  architectures  that  are  more  similar  to  LAPMAM:  VIP  [10],  SliM-II  [10]  and 
IMAP  VISION  [1 1],  In  this  comparison,  our  architecture  has  the  best  results  for  these 
algorithms.  Its  enhanced  parallelism  allows  the  reduction  of  the  algorithms  complexi¬ 
ties.  The  use  of  CAM  and  the  tree  structure  of  switches  in  interconnection  network 
make  the  LAPMAM  extremely  efficient  in  terms  of  connected  component  analysis 
and  median  filtering  tasks.  However,  because  of  the  MAM  modules,  the  architecture 
is  more  complex  than  the  ones  that  use  RAM.  LAPMAM  thus  necessitates  a  full  cus¬ 
tom  approach  for  its  hardware  implementation 


Table  2:  The  LAPMAM  estimated  time  results  compared  with  others  architectures 

(time  in  ms) 


Algorithm 

DARPA  II  Bench¬ 
mark  results 
for  a  512x512  image 

LAPMAM  similar  architec¬ 
tures 

LAPMAM 

50  MHz, 

512  PEs, 
512x512 
image 

VIP 

1024 
PEs,  50 
MHz 

SliM-II 

512 

PEs, 

40 

MHz 

IMAP- 
VISION 
512  PEs, 

40  MHz, 
256x240 
image 

CM 
64  K 

ASP 

IUA 

Labeling 

100 

22.8 

0.0596 

- 

_ 

19.5* 

0.614 

Median  filter 

15 

0.72 

0.5625 

3.672 

2.525 

1.07 

0.204 

Histogram 

- 

- 

- 

- 

3.313 

1.33 

0.420 

*  Worst-case  example 


1  Conclusion 


Fast  parallel  algorithms  for  labeling,  area,  perimeter,  histogramming  and  3x3  median 
i  termg  have  been  developed  in  a  new  parallel  architecture  dedicated  to  image  proc- 
essing.  The  quasi-optimal  processor  x  time  complexity  of  these  algorithms  and  the 
efficient  utilization  of  the  MAM  had  demonstrated  the  interest  of  this  architecture  for 
low  and  intermediate  level  vision,  particularly  for  connected  component  analysis  and 
median  filtering.  The  use  of  a  tree  structure  of  switches  has  proved  to  be  an  excellent 
so  ution  to  decrease  the  reduction  of  data  propagation  time  in  interconnection  net- 
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work.  Considering  the  algorithms  results,  the  system  presents  very  good  performance 
for  real  time  image  processing.  This  will  be  confirmed  with  the  development  of  other 
algorithms  and  the  system  hardware  implementation.  Another  algorithms  and  a 
LAPMAM  prototype  VLSI  are  under  development  at  the  moment. 
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Abstract.  The  most  demanding  image  processing  applications  require 
real  time  processing,  often  using  special  purpose  hardware.  The  work 
herein  presented  refers  to  the  application  of  cluster  computing  for  off 
line  image  processing,  where  the  end  user  benefits  from  the  operation 
of  otherwise  idle  processors  in  the  local  LAN.  The  virtual  parallel  com¬ 
puter  is  composed  by  off-the-shelf  personal  computers  connected  by  a 
low  cost  network,  such  as  a  10  Mbits/s  Ethernet.  The  aim  is  to  minimise 
the  processing  time  of  a  high  level  image  processing  package.  The  system 
developed  to  manage  the  parallel  execution  is  described  and  results  ob¬ 
tained  for  the  parallelisation  of  high  level  image  processing  algorithms  are 
discussed,  namely  for  active  contour  and  modal  analysis  methods  which 
require  the  computation  of  the  eigenvectors  of  a  symmetric  matrix. 


1  Introduction 

Image  processing  applications  are  computationally  demanding  due  to  the  amount 
of  data  to  be  processed,  to  the  response  time  required,  or  due  to  the  complexity 
of  the  image  processing  algorithms.  A  wide  range  of  hardware  has  been  used  for 
image  processing.  For  low  level  image  analysis,  where  each  processor  performs 
a  uniform  set  of  operations  based  on  the  image  data  matrix  in  a  fixed  amount 
of  time,  SIMD  computers  using  data  parallelism  may  be  used;  in  [28]  a  special 
purpose  SIMD  computer  with  1024  processors  was  presented.  Systolic  Arrays 
[11]  which  can  exploit  the  regular  and  constant-time  operations  of  an  algorithm 
are  also  a  possible  option. 

For  high  level  image  processing,  e.g.  pattern  recognition,  where  each  pro¬ 
cessor  is  assigned  an  independent  operation,  MIMD  supercomputers  commonly- 
used  in  simulation  have  been  used  [3].  For  real  time  vision  applications  special 
MIMD  computers  were  developed,  e.g.  ASSET-2  based  on  PowerPC  processors 
for  computation  and  on  Transputers  for  communication  [29].  MIMD  supercom¬ 
puters  were  characterised  by  allowing  a  diversity  of  structures,  however,  tech¬ 
nological  factors  have  been  forcing  a  convergence  towards  systems  formed  by 
a  collection  of  essentially  complete  computers  connected  by  a  communication 
network  [9].  The  processors  of  these  computers  become  the  same  ones  used  in 
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workstations.  Therefore,  the  idea  of  forming  a  parallel  computer  from  a  collection 
of  off-the-shelf  computers  comes  naturally,  and  fast  communication  techniques 
were  also  developed  for  that  purpose  [25].  Several  cluster  computing  systems 
have  been  developed,  e.g.  the  NOW  project  [2]. 

Our  aim  is  not  to  build  a  cluster  of  personal  computers  for  parallel  process¬ 
ing  but  to  perform  parallel  processing  on  already  existing  group  clusters,  where 
each  node  is  a  desktop  computer  running  the  Windows  operating  system.  These 
clusters  are  characterised  by  having  a  low  cost  interconnection  network,  such  as 
a  10  Mbits/s  Ethernet,  connecting  different  types  of  processors, -of  variable  pro¬ 
cessing  capacity  and  amount  of  memory,  thus  forming  a  heterogeneous  parallel 
virtual  computer.  Due  to  network  restrictions,  which  do  not  allow  simultaneous 
communication  among  several  nodes,  the  application  domain  is  restricted  to  one 
or  two  dozens  of  processors. 

The  motivation  for  a  parallel  implementation  of  image  algorithms  comes  from 
image  and  image  sequence  analysis  needs  posed  by  various  application  domains, 
which  are  becoming  increasingly  more  demanding  in  terms  of  the  detail  and 
variety  of  the  expected  analytic  results,  requiring  the  use  of  more  sophisticated 
image  and  object  models  (e.g.,  physically-based  deformable  models),  and  of  more 
complex  algorithms,  while  the  timing  constraints  are  kept  very  stringent. 

A  promising  approach  to  deal  with  the  above  requirements  consists  in  devel¬ 
oping  parallel  software  to  be  executed,  in  a  distributed  manner,  by  the  machines 
available  in  an  existing  computer  network,  taking  advantage  of  the  well-known 
fact  that  many  of  the  computers  are  often  idle  for  long  periods  of  time  [20]. 
It  is  quite  common  in  many  organisations  that  a  standard  network  connects 
several  general  purpose  workstations  and  personal  computers,  accumulating  a 
very  substantial  computing  power  that,  through  the  use  of  appropriate  manag¬ 
ing  software,  could  be  put  at  the  service  of  the  more  computationally  demanding 
applications. 

Existing  software,  such  as  the  Windows  Parallel  Virtual  Machine  (WPVM) 
[1],  allows  building  parallel  virtual  computers  by  integrating  in  a  common  pro¬ 
cessing  environment  a  set  of  distinct  machines  (nodes)  connected  to  the  network. 

.  Ithough  the  parallel  virtual  computer  nodes  and  the  underlying  communication 
network  were  not  designed  for  optimised  parallel  operation,  very  significant  per¬ 
formance  gains  can  be  attained  if  the  parallel  application  software  is  conceived 
for  that  specific  environment. 


2  Image  Algorithms  and  Systems 

The  image  algorithms  that  have  been  parallelised  consist  of  a  set  of  low  level 
image  processing  operations  namely  edge  detection  [27,6],  distance  transform, 
convolution  mask,  histogramming  and  thresholding,  whose  suitability  to  the  clus¬ 
ter  architecture  was  analysed  in  [4],  A  set  of  linear  algebra  algorithms  required 
for  high  level  image  processing  was  also  implemented.  The  algorithms  are  the 
matrix  product  [14],  LT  factorisation  [7],  tridiagonal  reduction  [8],  symmetric 
QR  iteration  [15],  matrix  inversion  [23]  and  matrix  correlation. 
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In  this  paper,  results  focus  on  high  level  image  processing  algorithms,  namely 
active  contours  [19]  and  modal  analysis  [26]. 

Some  image  processing  systems  have  been  proposed  to  run  on  a  cluster  of 
personal  computers.  In  [17]  two  highly  demanding  vision  algorithms  were  tested 
giving  superlinear  speedup,  due  to  memory  pagination  on  one  workstation.  The 
machines  formed  an  homogeneous  computer.  In  [18]  a  high  level  interface  parallel 
image  processing  library  is  presented  and  results  for  low  level  image  operations 
on  an  Ethernet  network  of  HP9000/715  workstations  and  an  ATM  network  of 
SGI  workstations  are  reported.  In  [21]  a  machine  independent  methodology  was 
proposed  for  homogeneous  computers;  results  were  presented  separately  for  two 
SMP  workstations  with  two  and  eight  processors,  not  requiring  communication 
between  machines. 

Our  implementation  differs  from  the  ones  mentioned  above  due  to  the  con¬ 
sideration  of  a  general  bus  type  heterogeneous  cluster  where  data  is  distributed 
in  order  to  obtain  a  correct  load  balancing  and  the  number  of  processors  that 
participate  in  a  distributed  algorithm  vary  dynamically  in  order  to  minimise  the 
processing  time  of  each  operation  [5]. 

3  The  System  Architecture 

The  computers  that  belong  to  the  virtual  machine  run  a  process  to  monitor 
the  percentage  of  processor  time  spent  with  the  local  user.  Conceptually,  local 
users  have  priority  over  the  distributed  application  and  the  computer  will  not 
be  available  if  the  mean  local  user  time  is  above  a  minimum  threshold  during  a 
specified  period  of  time,  e.g.  5  seconds. 

Each  algorithm  or  task  is  decomposed  until  indivisible  operations  are  ob¬ 
tained  to  which  parallel  code  exists.  When  a  parallel  algorithm  is  launched  the 
master  process  schedules  work  to  the  processors  of  the  virtual  machine  according 
to  their  availability  and  choosing  a  number  of  processors  that  minimise  the  pro¬ 
cessing  time  of  individual  operations,  allowing  data  redistribution  if  the  optimal 
grid  [4]  of  processors  changes  from  operation  to  operation. 

As  an  example,  the  algorithm  to  extract  the  contour  of  an  object  can  be  de¬ 
composed  into  edge  enhancement,  thresholding  and  contour  tracking  operations. 


3.1  Hardware  Organisation  and  Computational  Model 

The  hardware  organisation  is  shown  in  figure  1.  Each  node  of  the  virtual  ma¬ 
chine  is  a  personal  computer  under  the  Windows  NT  operating  system,  running 
WPVM  software  to  communicate.  The  interconnection  network  is  an  Ethernet 
at  10/100  Mbits/s. 

Several  computational  models  [9,30,16]  were  proposed  in  order  to  estimate 
the  processing  time  of  a  parallel  program  in  a  distributed  memory  machine. 
Although  they  could  be  adapted  for  the  cluster  of  personal  computers,  a  specific 
and  simplified  model  is  presented  below. 
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Fig.  1.  Hardware  organisation 


Each  node  of  the  machine  is  characterised  by  the  processor  capacity  S,. 
measured  in  Mflop.  The  network  is  characterised  by  allowing  only  one  message 
to  be  broadcast  at  a  given  time,  the  latency  time  (TL)  and  the  bandwidth  (LB). 
The  time  to  send  a  message  (TGomm)  composed  by  nb  bytes  is  given  by: 


=  TlK  + 


nb 
LB ’ 


K  = 


nb 


packetsize 


(1) 


The  value  K  multiplies  TL  due  to  the  partition  of  each  message  into  packets  of 

;eAn£th  46  t0  1500  bytes  ( packetsize ),  existing  a  latency  time  for  each  packet- 
1024  is  a  typical  packet  size. 

The  parallel  component  TP  of  the  computational  model,  equation  2,  repre¬ 
sents  the  operations  that  can  be  divided  over  a  set  of  p  processors  obtaining  a 
speedup  of  p,  i.e.  operations  without  any  sequential  part. 


TP(n,p) 


tp(n) 

EJU  Si 


(2) 


The  numerator  ifi(n)  is  the  cost  function  of  the  algorithm  measured  in  floating 
point  operations  (flop)  as  a  function  of  the  problem  size  n.  For  example,  to 
multiply  square  matrices  of  size  n,  the  cost  is  %f(n)  =  2 n3  [10] 


3.2  Software  Organisation 

Each  operation  is  represented  by  an  object  containing  the  parallel  and  serial 
implementation  of  the  code,  since  the  system  can  schedule  a  sequential  execu¬ 
tion  remotely  if  it  is  advantageous.  The  object  associated  to  the  operation  also 
contains  the  computational  complexity  and  the  amount  of  data  required  to  ex¬ 
change  in  order  to  complete  the  operation.  Based  on  these  parameters  the  system 
time™]1168  the  nUmber  and  which  Processors  minimise  the  operation  processing 

Each  data  instance  to  be  processed,  an  image  or  a  matrix,  is  represented  by 
an  object  responsible  for  accessing  data  items  correctly  according  to  the  data 
distribution  information. 

Data  distribution  is  represented  by  independent  objects  with  functions  to 
tocate  any  item  of  data  and  to  translate  global  to  local  indexes  and  vice-versa. 
Each  object  can  be  shared  by  more  than  one  data  instance.  Figure  2  shows  the 
software  organisation. 
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Operations  to 
be  executed 


Data  instances 


Data  Distribution 
Objects 


Fig.  2.  Software  organisation 


The  user  describes  a  macro  of  sequential  operations  to  be  executed  referring 
the  data  instances  to  be  processed.  The  system  executes  each  operation  in  par¬ 
allel  determining  for  each  one  the  number  of  processors  to  be  used  in  order  to 
minimise  the  processing  time.  The  data  distribution  suitable  for  each  opeiation 
is  codified  in  the  operation  code. 


Input  il  i mage 1 . bmp 
Shencastan  il  i2  i3  0 
Histogram  i2  outfile.txt  i4 
Output  i2 
Output  i4 

Fig.  3.  Macro  describing  the  operations  to  be  executed 


Figure  3  shows  an  example  of  a  macro.  To  the  input  file  il  an  edge  detector 
[27]  is  applied,  the  operator  output,  the  magnitude  and  direction,  being  stored 
in  i2  and  iS  respectively.  The  histogram  is  then  computed  and  displayed  as  an 
image,  being  also  saved  in  a  text  file. 


3.3  Data  Distribution  and  Load  Balancing 

Different  strategies  are  applied  to  images  and  matrices.  Images  are  partitioned 
in  blocks  of  contiguous  rows  or  columns  and  the  blocks  are  assigned  to  each 
process  [4].  This  distribution  is  suitable  for  data  independent  image  operators. 
The  matrices  are  organised  in  square  blocks  of  data  and  a  heterogeneous  adapted 
version  [5]  of  the  block  cyclic  domain  distribution  [13]  is  used  to  assign  them  to 
the  processor  grid. 

A  balanced  distribution  is  achieved  by  a  static  load  distribution  made  prioi  to 
the  execution  of  the  parallel  operation.  To  achieve  a  balanced  distribution  in  the 
heterogeneous  machine  the  relative  amount  of  data  assigned  to  each  processor, 
lu  is  a  function  of  its  processing  capacity  compared  to  the  entire  machine: 

t  =  s,-/£  [.,«*-  <3) 
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For  matrices,  due  to  block  indivisibility  it  is  not  always  possible  to  ensure  an 
optimal  load  balancing,  however,  the  scheduler  computes  the  optimal  solution 
for  a  given  network  [5].  The  processor  placement  on  the  grid  is  also  done  in  order 
to  achieve  a  balanced  distribution. 

4  Parallel  Implementation  of  the  Active  Contour 
Algorithm 

An  active  contour  is  defined  as  an  energy  minimising  curve  subjected  to  the 
action  of  internal  forces  and  influenced  by  image  forces  which  move  the  contour 
to  the  relevant  features  in  the  image  such  as  lines  and  edges  [19]. 

Active  contours  can  be  used  in  a  diversity  of  feature  extraction  operations 
in  images,  such  as  detection  of  lines  and  edges,  detection  of  subjective  contours, 
track  analysis  in  a  sequence  of  images  or  correspondence  analysis  in  stereo  im¬ 
ages. 


Detected  edges  Distance  Transform  Contour  detection 

Fig.  4.  Application  of  the  active  contour  algorithm  in  an  angiocardiographic  image 


Figure  4  (rightmost  image)  shows  the  contour  detection  over  the  original 
image  of  64  KB.  From  an  initial  position  (arbitrary  or  interactively  defined), 
using  an  iterative  process,  the  contour  moves  in  order  to  minimise  its  energy.  The 
final  position  corresponds  to  a  local  minimum  of  the  defined  energy  function.  In 
this  position,  the  forces  applied  to  the  contour  are  mutually  cancelled,  such  that 
the  contour  does  not  move.  The  energy  function  was  computed  based  on  the  edge 
detection  map  (leftmost  image)  and  the  distance  transform  map  (middle  image). 
The  quality  of  the  detection  depends  on  these  two  images.  Different  energy 
functions  can  be  used  [24],  however,  not  all  are  suitable  for  every  application. 

The  contour  points  distant  from  the  edges  are  pushed  in  their  direction  by 
the  distance  transform.  The  points  near  edges  are  influenced  by  the  edge  map 
energy  which  locally  refines  the  detection. 

Figure  5  shows  the  tasks  required  to  apply  the  active  contour  algorithm. 
The  computation  methodology  is  to  sequentially  execute  each  parallelised  task, 
choosing  the  processors  grid  that  minimises  the  individual  processing  time  and 
consequently  the  overall  time. 
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Fig.  5.  Active  contour  algorithm  decomposed  in  indivisible  tasks 


The  image  operators  have  been  discussed  in  another  paper  [4].  Therefore, 
only  the  parallelisation  of  the  LU  factorisation  routine  is  considered  here. 


4.1  LU  Factorisation  Algorithm 

The  LU  factorisation  algorithm  is  applied  in  order  to  solve  directly  the  sys¬ 
tem  of  equations  resulting  from  the  active  contour  internal  forces:  elasticity  and 
flexibility.  The  implementation  follows  the  right-looking  variant  of  the  algorithm 
proposed  in  [12].  However,  adaptations  where  made  at  the  load  distribution  level 
in  order  to  obtain  a  balanced  load  for  heterogeneous  machines.  Figure  6  (left) 
shows  the  load  distribution  obtained  in  a  heterogeneous  virtual  machine. 


LU  algorithm  QR  algorithm 


Fig.  6.  LU  and  QR  load  distribution  for  a  matrix  size  of  1800  and  1600  respectively 
for  the  machine  M={244,  244,  161,  161,  60,  50,  49}  Mflops  processors 


For  processor  grids  (1,4)  and  (1,5)  a  very  good  load  balancing  is  achieved. 
For  the  other  grids  the  three  slower  processors  took  approximately  15%  less 
time  than  the  other  ones,  due  to  the  block  indivisibility.  The  algorithm  requires 
a  significant  number  of  communication  points  which  results  in  a  not  very  scalable 
algorithm  as  shown  in  figure  7  (left). 

The  scalability  analysis  was  made  in  a  homogeneous  machine  in  order  to 
reduce  the  influence  of  load  imbalances. 
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LU,  TRD  and  QR  Matrix  Correlation 


Fig.  7.  Isogranulaxity  curves  for  a  6  processor  homogeneous  machine  connected  by  a 
10  Mbit/s  Ethernet;  160  K  elements  for  TRD,  LU  and  QR  and  250K  for  LU2 


5  Parallel  Implementation  of  the  Modal  Matching 
Algorithm 

This  high  level  image  processing  algorithm  [26]  is  applied  for  the  tracking  of 
deformable  objects  over  a  sequence  of  images.  Figure  8  shows  the  application  of 
the  algorithm.  It  is  based  on  finite  element  analysis  requiring  the  computation 
of  eigenvectors  of  symmetric  matrices.  The  aim  is  to  obtain  correspondences  be¬ 
tween  object  points  of  image  i  and  i+n.  The  algorithm  is  divided  into  eigenvector 
computation  and  matrix  correlation.  The  eigenvector  computation  is  subdivided 
into  three  operations:  tridiagonalisation,  correspondent  orthogonal  matrix  and 
QR  iteration.  The  parallelisation  is  then  realised  by  the  individual  parallelisation 
o  each  operation.  Data  is  redistributed  if  the  processor  grid  changes  between 
operations. 


Instant  i  Instant  i  +  2  Matching 
Fig.  8.  Application  of  the  modal  analysis  algorithm  to  a  sequence  of  the  heart  beaten 


5.1  Tridiagonal  Reduction  and  Orthogonal  Matrix  Computation 

Tridiagonal  reduction  is  the  first  algorithm  applied  to  the  symmetric  matrix  in 

order  to  obtain  the  eigenvectors.  The  algorithm  output  is  a  tridiagonal  matrix 
1  so  that: 

A  =  QtTQ  (4) 
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The  matrix  T  replaces  A  in  memory.  As  shown  in  figure  9  the  best  grid  is  a 
row  of  processors.  Details  of  the  algorithm  can  be  found  in  [8]. 

The  matrix  elements  of  T.  apart  from  the  tridiagonal  positions,  store  the  data 
required  for  the  second  step  of  the  eigenvector  algorithm,  i.e.  the  computation 
of  Q. 

If  the  order  of  computation  of  the  tridiagonal  reduction  was  followed,  an 
0(n4)  algorithm  would  be  obtained,  corresponding  to  a  matrix  by  matrix  prod¬ 
uct  in  each  step;  n- 2  steps  for  a  matrix  of  size  (n,  n).  However,  the  computation 
can  be  efficiently  organised  as  described  in  [22]  for  a  sequential  algorithm,  ob¬ 
taining  a  scalable  operation  for  the  virtual  machine.  Figure  9  shows  that  the 
best  grid  is  a  row  of  processors. 

5.2  The  Symmetric  QR  Iteration 

The  QR  iteration  is  the  last  operation  for  the  eigenvector  computation.  The  aim 
is  to  obtain  from  the  tridiagonal  matrix  T  one  diagonal  A  where  the  elements 
are  the  eigenvalues  of  A: 

T  =  GT  AG  (5) 

The  matrix  G  is  then  used  to  compute  the  eigenvectors  Q'  of  A: 

Q'  =  QGT  (6) 

Matrix  GT  is  obtained  by  iterating  and  updating  it  with  the  Givens  rotations 
[15].  To  obtain  Q'  a  matrix  by  matrix  product  would  be  required.  However,  the 
operations  can  be  organised  in  order  to  update  Q'  in  each  iteration  avoiding  the 
last  matrix  product.  In  the  update  only  two  columns  of  Q'  are  updated.  Based 
on  this  fact  a  scalable  operation  was  implemented  by  allowing  the  redistribution 
of  data.  The  optimal  data  distribution  is  blocks  of  rows  so  that  any  given  row 
is  completely  allocated  to  a  given  processor,  avoiding  communications  between 
processors  for  the  update  of  Q' ■  The  parallelisation  implemented  keeps  the  0(n  ) 
chase  operation  in  one  processor  which  computes  all  rotations  for  an  iteration, 
and  distributes  them  over  a  column  of  processors.  Then  all  processors  update 
their  rows,  the  0(n3)  part,  in  parallel  without  communications.  This  strategy  has 
a  huge  impact  in  the  scalability  of  the  QR  iteration  as  shown  by  the  isogranularitj 
curve  in  figure  7.  A  good  load  balancing  is  also  achieved  for  a  heterogeneous 
machine  as  shown  in  figure  6. 

The  ideal  grid  for  QR  iteration  is  the  opposite  (column  vs.  row)  of  the  ones 
for  tridiagonal  and  orthogonal  matrix  computation.  This  is  the  reason  for  con¬ 
sidering  indivisible  operations  and  allowing  redistribution  of  data  between  them 
to  adapt  the  parallel  machine  to  each  operation. 

5.3  Matrix  Correlation 

After  QR  iteration  has  been  computed  for  the  objects  in  both  images  the  eigen¬ 
vectors  are  ordered  in  decreasing  order  of  magnitude  of  the  correspondent  eigen¬ 
value.  The  correlation  operation  measures  the  similarity  between  the  eigenvectors 
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of  both  objects.  The  behaviour  of  the  processing  time  function  shown  in  Figure 
9  is  different  from  the  other  operations.  The  best  grid  is  either  a  row  or  a  column 
of  processors.  The  parallel  algorithm  is  scalable  as  shown  in  figure  7. 


Tndiagonal  reduction  Orthogonal  matrix  Matrix  correlation 


Fig.  9.  Estimated  processing  time  for  a  6  processor  homogeneous  machine  connected 
by  a  10  Mbit/s  Ethernet 


6  Results 


preSented  for  machine  M1  composed  by  6  homogeneous  processors 
of  141  Mflop  each,  M2={244,  244,  161,  161,  60,  50,  49}  Mflop  and  M3={161 

UJ2’  !°/  Mfl°P  Processors-  Ml  is  connected  by  10  Mbit/s  Ethernet,  and 
M2  and  M3  by  a  100  Mbit/s  one.  The  performance  metrics  used  to  evaluate  the 
parallel  application  is,  first,  the  runtime,  and  second  the  speedup  achieved.  To 
have  a  fair  comparison  in  terms  of  speedup,  one  defines  the  Equivalent  Machine 
JVumber  (. EMN(p ))  which  considers  the  power  available  instead  of  the  number 
of  machines  that,  for  a  heterogeneous  environment,  is  an  ambiguous  information. 
Equation  7  defines  EMN(p)  and  heterogeneous  efficiency  EH,  for  p  processors 
used  where  Si  is  the  computational  capacity  of  the  processor  that  executed  the 
serial  code,  also  called  the  master  processor. 


y'P  c. 

EMN  (p)  = 

*1 


j-,  _  Speedup 
H  ~  EMN(p) 


For  the  machine  M3  EMN{4)  -  3.19,  i.e.  using  4  processors  of  the  heteroge- 
neous  machine  is  equivalent  to  3.19  processors  identical  to  the  master  processor 
if  it  is  the  161  Mflop  one. 

The  right  table  of  figure  10  presents  results  for  the  parallel  active  contour 
algorithm  m  the  M3  machine  for  an  image  of  64  KB  (figure  4)  and  for  a  256  KB 
one  (the  left  picture  in  figure  10).  The  time  Tt  represents  the  processing  time 
of  the  serial  code  in  the  master  processor  and  TP  the  parallel  processing  time 
in  the  virtual  machine.  The  number  of  processors  selected  in  each  step  of  the 
algorithm  changes  in  order  to  minimise  the  processing  time. 
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Skin  tumor  detection 
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4 
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1 

4 
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2 
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1 
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2.51 

2.19 

EMN 
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Eh 

0.79 
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Active  contour  results 


Fig.  10.  Application  results  of  the  active  contour  algorithm 
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Fig.  11.  Eigenvector  computation  in  the  M2  machine 


Results  for  of  the  eigenvector  computation  are  presented  in  figure  11  for 
machine  M2  due  to  the  wide  application  of  the  algorithm.  As  shown,  the  het¬ 
erogeneous  efficiency  is  near  80%  for  matrices  with  more  than  1400"  elements. 
However,  the  first  metric  is  processing  time  which  is  reduced  for  matrices  larger 
than  4002  elements. 

To  show  the  importance  of  the  parallel  processing  system,  results  for  the 
modal  analysis  algorithm  are  presented  for  the  homogeneous  machine  Ml,  figure 
12.  The  left  chart  compares  the  computation  time  of  the  virtual  machine  VM 
when  the  optimal  number  of  processors  is  selected,  as  indicated  in  the  processing 
results  table,  against  the  processing  time  when  the  same  number  of  processors 
are  used  for  all  stages  of  the  algorithm.  The  minimum  time  is  obtained  with  4 
processors,  however,  it  is  higher  than  the  time  obtained  for  VM. 


7  Conclusions 

A  operation  based  parallel  image  processing  system  for  a  cluster  of  personal  com¬ 
puters  was  presented.  The  main  objective  is  that  the  user  of  a  computationallj 
demanding  application  may  benefit  from  the  computational  power  distributed 
over  the  network,  while  keeping  other  active  users  undisturbed. 

This  goal  can  be  achieved  in  a  transparent  manner  for  the  user,  once  the 
modules  of  his/her  application  are  correctly  parallelised  for  the  target  network 
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Fig.  12.  Modal  analysis  in  the  homogeneous  machine  Ml 


and  the  performance  of  the  machines  in  the  network  is  known.  The  applica¬ 
tion.  before  initiating  a  parallel  module,  determines  the  best  available  computer 
composition  for  a  parallel  virtual  computer  to  execute  it,  and  then  launches  the 

module,  achieving  the  best  response  time  possible  in  the  actual  network  condi¬ 
tions. 

Practical  tests  were  conducted  both  on  homogeneous  and  heterogeneous  net¬ 
works.  In  both  cases  the  theoretically  optimal  computer  grid  was  confirmed  by 
the  measured  performance.  A  balanced  load  was  achieved  in  both  machines.  The 
machine  scalability  depends  essentially  on  the  communication  requirements  of 
the  operations.  For  QR  iteration  and  matrix  correlation  the  system  is  scalable, 
however,  it  is  not  for  the  tridiagonal  reduction. 

Other  generic  modules  will  be  parallelised  and  tested,  so  that  an  ever  increas¬ 
ing  number  of  image  analysis  methods  may  be  assembled  from  them.  Application 
omams  other  than  image  analysis  may  also  benefit  from  the  proposed  method¬ 
ology. 
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Abstract 

This  paper  analyzes  the  impact  of  hardware  multithreading  support  on  the  per¬ 
formance  of  distributed  shared-memory  (DSM)  multiprocessors  built  out  of  het¬ 
erogeneous,  single-chip  computing  nodes.  Area- efficiency  arguments  motivate  a 
heterogeneous,  hierarchical  organization  (HDSM)  consisting  of  few  processors 
with  extensive  support  for  instruction-level  parallelism,  and  large  caches,  and  a 
larger  number  of  simpler  processors  with  smaller  caches  for  efficient  execution  of 
thread-parallel  code.  Such  heterogeneous  machine  relies  on  the  execution  of  multi¬ 
ple  threads  per  processor  to  deliver  high  performance  for  unmodified  applications. 
This  paper  quantitatively  studies  the  performance  of  HDSMs  for  software-based 
and  hardware-multithreaded  scenarios.  The  simulation-based  experiments  in  this 
paper  consider  a  16-node  multiprocessor,  six  homogeneous  shai ed-m emory  bench¬ 
marks  from  the  SPLASH-2  suite,  and  a  decision-support  application  (Cf.5).  Sim¬ 
ulation  results  show  that  a  hardware-based,  block-multithreaded  HDSM  configu¬ 
ration  outperforms  a  software-multithreaded  counterpart ,  on  average,  by  13%. 


1  Introduction 

Continuing  technological  advances  in  VLSI  manufacturing  are  predicted  to  bring 
about  billion-transistor  chips  in  the  next  decade  [15].  Such  large  transistor  bud¬ 
get  allows  for  the  implementation  of  high-performance  uniprocessors  [12]  that  ag¬ 
gressively  exploit  instruction-level  parallelism  (ILP ) ,  as  well  as  chip-multiprocessors  [8] 
that  can  efficiently  execute  explicitly  parallel  tasks. 

Large  multiprocessor  configurations  of  the  future  will  be  able  to  use  such 
high-performance  components  as  commodity  building  blocks  in  their  design.  Pre¬ 
vious  work  [6]  has  shown  that  combining  nodes  of  different  processor  and  memory 
characteristics  into  a  heterogeneous  distributed  shared-memory  (HDSM)  mul¬ 
tiprocessor  leads  to  area-efficient  designs. 

This  work  was  partially  funded  by  the  National  Science  Foundation  under  grants 
CCR-9970728  and  E1A-9975275.  Renato  Figueiredo  is  also  supported  by  a  CAPES 
scholarship.  Candidate  to  the  best  student  paper. 
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An  HDSM  combines  few  high-performance  processors  and  memories  with  a 
larger  number  of  simpler  processors  and  smaller  memories  to  form  a  hierarchi¬ 
cal.  heterogeneous  system  [1]  capable  of  fast  execution  of  both  sequential  and 
parallel  codes.  Figure  1  depicts  the  organization  of  an  HDSM. 


Level  1 


Level  2 


Level  3 


Fig.  1.  HDSM:  processor-and-memory  hierarchical  organization.  Processors  and  mem¬ 
ories  are  drawn  such  that  processor  performance  and  memory  capacity  are  proportional 
to  their  area  m  the  figure. 


The  proposed  heterogeneous  DSM  machines  rely  on  the  execution  of  multiple 
threads  per  processor  to  deliver  high  performance  for  unmodified,  homogeneous 
applications  Previous  work  has  studied  the  performance  of  HDSMs  assuming  a 
software  multi-taskmg  model  based  on  voluntary  context  switches.  This  model  is 
valid  for  current  commodity  microprocessors  that  do  not  provide  hardware  mech¬ 
anisms  to  implement  fine-grain  multithreading.  However,  hardware  multithread¬ 
ed  microarchitectures  are  currently  being  used  in  commercial  processors  [16]  and 
considered  m  the  implementation  of  future-generation  high-performance  micro¬ 
processors  [4], 


this  paper  extends  the  performance  studies  of  HDSMs  reported  in  [6]  bv 
quantitatively  analyzing  the  impact  of  hardware  multithreading  on  their  perfor¬ 
mance  .  Tins  paper  also  complements  previous  work  by  employing  a  simulation 
model  that  explicitly  accounts  for  heterogeneity  of  processor  performance  due 


,  J f  qr’^analyS;S  performed  via  simulation  of  parallel  benchmarks 
rom  the  SPLASH-2  suite  [1#]  and  of  a  hand-paralellized  decision-support  ap- 
plication  (C4o  [13]).  Benchmarks  are  simulated  individually  to  studv  single- 
program  parallel  speedup.  All  benchmarks  are  programmed  with  single-program. 
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multiple-data  (SPMD)  extensions  to  the  C  language. 

A  modified  version  of  the  RSIM  [10]  multiprocessor  simulator  is  used  in  the 
experiments.  The  original  RSIM  simulator  models  DSM  machines  built  out  of 
homogeneous  ILP  processors,  with  no  hardware  support  for  multithreading.  It 
has  been  modified  for  the  performance  analysis  shown  in  this  paper  to  model 
heterogeneity  of  ILP  processors  and  caches,  and  to  model  hardware  support  for 
multithreading. 

This  paper  is  organized  as  follows.  Section  3  describes  the  heterogeneous 
DSM  machine  model  studied  in  this  paper.  Section  3  presents  the  experimental 
methodology  used  in  the  performance  study.  Section  4  presents  experimental 
results  and  data  analyses.  Section  5  concludes  this  paper. 

2  Machine  model 
2.1  Heterogeneous  DSMs 

HDSM  machines  differ  from  conventional  distributed  shared-memory  multipro¬ 
cessors  in  that,  processors,  memories  and  networks  of  HDSMs  may  be  heteroge¬ 
neous.  In  this  paper,  processor  heterogeneity  is  modeled  in  terms  of  degree  of 
support  for  ILP.  Heterogeneity  in  the  memory  subsystem  is  modeled  in  terms  of 
LI  and  L2  cache  sizes  and  access  times.  Heterogeneity  of  the  network  subsystem 
is  not  modeled  in  this  paper. 

The  heterogeneity  of  processors  and  caches  is  motivated  by  area/parallelism 
tradeoffs  in  the  design  of  future-generation  microprocessors:  the  system  consists 
of  a  combination  of  few,  aggressive  uniprocessors  with  large  caches  and  many 
simpler  processors  with  smaller  individual  caches.  The  former  processors  devote 
large  numbers  of  transistors  to  deliver  high  performance  for  sequential  codes, 
while  the  latter  processors  have  smaller  silicon  area  requirements  and  deliver 
high  performance  for  parallel  codes. 

The  area/parallelism  argument  that  motivates  the  design  of  HDSMs  is  based 
on  the  use  of  area-efficient,  simple  processors  for  execution  of  parallel  codes,  and 
aggressive  uniprocessors  for  execution  of  sequential  codes.  For  highly  parallel 
tasks,  the  high-performance  uniprocessors  can  also  be  assigned  to  parallel  com¬ 
putation. 

Previous  work  has  shown  that  a  software-based  assignment  of  multiple  thread- 
s  to  the  high-performance  ILP  uniprocessors  of  an  HDSM  yields  performance 
improvements  for  memory-  and  cpu-intensive  programs  [6].  Context,  switches  in 
software  multi-tasking  occur  infrequently,  and  have  large  execution  time  over¬ 
heads.  Such  coarse-grain  model  limits  the  potential  for  overlapping  high-latency 
shared-memory  DSM  operations. 

Research  on  multi-threaded  processors  has  shown  that  aggressive  ILP  u- 
niprocessors  can  be  enhanced  to  support  multiple  threads  with  small  increases 
in  chip  area  requirements  [4].  The  implementation  of  hardware  multi-threading 
extensions  into  the  aggressive  ILP  processors  of  an  HDSM  can  increase  over¬ 
all  sy  stem  performance  by  increasing  the  potential  for  overlapping  of  shared- 
memory  accesses.  To  investigate  the  performance  of  such  enhanced  system,  the 
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high-performance  processors  of  the  HDSM  machine  modeled  in  this  paper  have 
hardware  support  for  block-multithreading. 


2.2  HDSM  multiprocessor  configuration 

The  HDSM  multiprocessor  under  study  consists  of  sixteen  nodes.  Each  node 
contains  a  single  processor,  LI  and  L2  data  caches,  main  memory  and  a  remote 
access  device  (RAD)  with  network  interface  and  coherence  controller.  The  nodes 
are  interconnected  by  a  2-D  mesh.  Cache  coherence  is  maintained  via  a  directory 
controller  that  implements  the  MESI  [11]  protocol.  The  release  consistency  [7] 
memory  model  is  assumed  in  this  study.  Figure  2  depicts  the  machine  model. 


Fig.2.  HDSM  model:  each  heterogeneous  node  has  a  single  processor  fP).  two  levels  of 
data  cache  (Ll,L2),  mam  memory  (MEMj  and  a  remote  access  device  /RAD  )  all 
connected  by  a  memory  bus.  Nodes  are  interconnected  via  a  mesh  network 


eneitf  • iS  PreSent  m  b°th  the  Processor  memory  subsystems  of 
he  HDSM  machine.  Processor  heterogeneity  is  modeled  in  terms  of  the  size 
of  hardware  structures  dedicated  to  ILP  exploitation.  The  heterogeneous  ILP 
parameters  investigated  in  this  paper  are  issue  rate,  instruction  window  size 
number  of  arithmetic  (ALU),  floating-point  (FPU)  and  address  units,  and  max¬ 
imum  number  of  outstanding  cache  misses  (MSHRs  [9]).  Heterogeneity  in  the 
memory  subsystem  is  modeled  in  terms  of  the  size  and  speed  of  caches' 

The  HDSMs  under  study  have  three  levels,  with  2,  4  and  10  nodes  in  levels 
h  ail.(  respectively.  The  machine  is  configured  as  a  processor-and-memory 
hierarchy  [1J;  the  number  of  processing  elements  increases  from  top  to  bottom 
levels  of  the  hierarchy,  while  cache  sizes  and  the  performance  of  processors  and 
cache  memories  decrease  from  top  to  bottom  levels.  Table  1  shows  the  processor 
and  memory  configurations  assumed  for  each  machine  level. 

The  inter-processor  network  is  assumed  to  be  homogeneous.  This  assump¬ 
tion  is  conservative  in  accounting  for  inter-processor  communication  latencies 
Given  the  predicted  integration  level  of  next-generation  microprocessors,  it  is 
conceivable  that  HDSM  levels  built  out  of  simple  processors  be  integrated  into 
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level  i= 

1  level  i=2  level  1=3 

number  of  processors(i) 

2 

4 

10 

issue  witdh(i) 

8 

4 

1 

instruction  window  size(i) 

128 

64 

8 

number  of  ALU/FPU/address  units(i) 

4 

2 

1 

number  of  MSHRs(i) 

12 

8 

4 

Ll  cache  size(i) 

32KB 

16KB 

8KB 

L2  cache  size(i) 

1MB 

256KB 

64KB 

L2  cache  miss  detection  latency(i) 

10 

5 

3 

L2  cache  hit  latency (i) 

25 

13 

8 

Table  1.  3-level,  16-processor  heterogeneous  machine  configuration.  L2  cache  miss 
detection  and  hit  latencies  are  shown  in  terms  of  clock  cycles. 


single-chip  multiprocessors  [8,6].  Such  configuration  would  allow  smaller  intra¬ 
level  latencies  than  those  assumed  in  the  machine  model. 


2.3  Heterogeneous  node  configurations 

The  configuration  of  the  level-3  processor  is  based  on  a  simple  out-of-order  micro¬ 
processor  pipeline  that  issues  one  instruction  per  cycle.  The  level-2  configuration 
is  based  on  current  high-performance,  out-of-order  microprocessor  designs  [5]. 
The  high-performance  level- 1  processor  is  based  on  predicted  configurations  of 
future-generation  ILP  microprocessors  [3,14]. 

The  cache  sizes  of  the  level- 1  processor  are  dimensioned  so  that  the  LI  and 
L2  data  caches  are  large  enough  to  hold  the  primary  and  secondary  working 
sets,  respectively,  of  the  SPLASH-2  benchmarks  [18].  Cache  sizes  of  lower-level 
processors  are  scaled  down  (with  respect  to  the  adjacent  upper  level)  by  factors 
of  2  (LI  cache)  and  4  (L2  cache). 

The  LI  cache  access  times  are  assumed  to  be  a  single  processor  cycle  for 
all  processor  configurations:  it  is  assumed  that  clock  cycles  are  the  same  foi  all 
processors  and  that  the  level- 1  caches  are  designed  to  match  the  pipeline  clock. 
The  L2  cache  tag  and  data  access  times  are  modeled  after  the  analytical  cache 
access  time  model  described  in  [17],  assuming  a  0.18pm  technology  [14]. 

The  remaining  processor  and  memory  simulation  parameters  are  homoge¬ 
neous  across  HDSM  nodes  and  are  set  to  the  default  values  of  the  original  RSIM 
simulator. 


2.4  Programming  model 

This  paper  considers  the  execution  of  homogeneous  parallel  applications  on 
HDSMs.  These  programs  are  written  in  the  single-program,  multiple  data  (SP- 
MD)  model.  The  homogeneous  programs  are  mapped  onto  heterogeneous  re- 
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sources  without  source  code  modifications  via  static  thread-to-processor  assign¬ 
ment  schemes.  The  next  subsection  details  the  two  assignment  schemes  studied 
in  this  paper. 


2.5  Multi-threading  model 

In  this  paper,  two  policies  are  considered  in  the  assignment  of  threads  to  hetero¬ 
geneous  processors.  In  the  virtual-processor  policy,  both  software  and  hardware 
support  for  multithreading  are  studied. 


1.  Single-thread:  one  thread  is  assigned  to  each  processor  in  the  system. 

I.  Virtual-processor:  in  this  scheme,  a  processor  P{  is  assigned  VP(i)  threads, 
where  V  P(i)  is  the  ratio  between  Pp s  performance  and  the  slowest  processor 
in  the  system.  This  ratio  is  obtained  from  the  uniprocessor  simulation  results 
summarized  in  Figure  3  (benchmarks  that  require  power-of-two  number  of 
processors  are  assigned  5,3,  and  1  threads  to  processors  in  levels  1  2  and  3 
respectively).  There  are  two  different  multithreading  scenarios  studied  under 
this  assignment  policy: 

(a)  Software  multithreading:  in  this  scenario,  thread  context  switches  are 
triggered  only  by  failed  synchronizations  on  locks  and  barriers.  To  irnple- 
ment,  this  switching  criterion,  the  RSIM  synchronization  librarv  has  been 
modified  to  include  a  voluntary  context-switch  call  in  the  spin-waiting 
loop  of  the  synchronization  operations.  The  software  context-switching 
overhead  is  modeled  in  the  simulator  by  forcing  the  switching  proces¬ 
sors  to  be  idle  for  a  configurable  number  of  clock  cvcles.  The  context 
switching  overhead  in  this  scenario  is  800  processor  cycles. 

(b)  Hardware  multithreading:  in  this  scenario,  hardware  support  for  block¬ 
multithreading  [2]  is  available  in  the  HDSM  level- 1  and  level- 2  proces¬ 
sors.  Thread  context  switches  are  triggered  by  the  following  criteria  (in 
addition  to  failed  synchronization):  when  L2  cache  misses  occur,  when 
the  number  of  cycles  without  any  instruction  graduation  exceeds  the 
threshold  Tgrad,  and  when  the  total  number  of  cycles  without  any  thread 
context  switch  exceeds  the  threshold  Tswitch.  In  this  paper,  Tgrad  and 
-t  switch  are  set  to  20  and  10000  processor  cycles,  respectively.  The  context 
switching  overhead  is  this  scenario  is  set  to  4  processor  cycles.  In  addi¬ 
tion  threads  are  guaranteed  not  to  be  context-switched  for  a  minimum 
run  length  of  4  cycles. 


3  Experimental  methodology 

3.1  Benchmarks 


Il'e*et  °f  benchmarks  used  111  this  paper  includes  six  programs  from  the 
r/i  - rmi'te  a”d  3  parallehzed  version  of  the  decision-support  database 


SPLASH- 

program 
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The  programs  (and  respective  datasets)  studied  in  this  paper  are  CJf.5  (adult 
dataset  with  unknowns  removed  and  a  minimum  node  size  of  100),  FFT  (16I\ 
points).  FMM  (4096  particles),  LU  (256x256  matrix),  Ocean  (258x258  ocean), 
Radix  (512K  integers)  and  Water  (512  molecules).  All  benchmarks  are  compiled 
with  Sun  Microsystem’s  Workshop  C  compiler  version  4.2  and  optimization  level 
-x04. 


3.2  Simulation  environment 

The  simulation  environment  is  based  on  a  modified  version  of  the  RSIM  simula¬ 
tor  [10]  that  models  a  release-consistent  DSM  machine  connected  by  a  2-D  mesh, 
with  uniprocessor  heterogeneous  nodes  with  support  for  block- multithreading. 

4  Experimental  results 

In  this  section,  the  performance  of  HDSMs  is  analyzed  for  the  thread  assign¬ 
ment  schemes  described  in  Section  3.  Initially,  the  relative  performance  of  the 
individual  heterogeneous  processors  is  discussed.  Subsequently,  the  impact  of 
multithreading  on  HDSM  performance  is  analyzed. 


4.1  Impact  of  ILP  heterogeneity  on  single-node  performance 

Figure  3  shows  the  performances  of  the  heterogeneous  processors  and  caches  in 
terms  of  speedups  with  respect  to  a  base  (level-3)  processor.  The  level-2  and 
level-1  processors  outperform  the  single-issue  level-3  processor,  on  average,  by 
277%  and  396%.,  respectively.  Since  clock  speeds  are  assumed  to  be  the  same  for 
all  processors,  the  performance  differences  between  the  heterogeneous  processors 
are  due  to  instruction-level  parallelism  and  cache  sizes  only. 

Figure  3  shows  that  an  eight-fold  increase  in  issue  rate  and  a  sixteen-fold 
increase  in  L2  cache  yield  an  average  four-fold  performance  improvement  of  the 
level- 1  processor  over  the  simple  level-3  processor.  This  result  is  consistent  with 
the  area-efficiency  analysis  based  on  a  case  study  of  Alpha  microprocessors  pre¬ 
sented  in  [6] .  The  increase  in  chip  area  necessary  to  implement  larger  caches  and 
structures  devoted  to  the  extraction  of  ILP  yields  sub-linear  gains  in  performance 
under  the  assumption  of  same  fabrication  technology  (and  clock  cycle). 

4.2  Parallel  speedup  analysis 

Figure  4  shows  the  speedups  of  the  16-node  HDSM  with  respect  to  the  base 
(level-3)  processor  for  the  three  different  assignment  scenarios  described  in  Sec¬ 
tion  3.  In  the  virtual-processor  assignment,  4,  2  and  1  threads  are  assigned  to 
level-1,  level-2  and  level-3  processors,  respectively  (except  for  benchmarks  that 
require  power-of-two  processors,  where  5.  3  and  1  threads  are  assigned  to  proces¬ 
sors  of  levels  1.  2  and  3).  The  simulation  results  show  that  the  virtual-processor 
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Fig.  3.  Simulated  uniprocessor  speedups  (with  respect  to  level-3  processor)  of  the  het- 
erogeneous  configurations .  shown  in  Table  1. 


assignment  significantly  outperforms  the  single-thread  assignment  under  both 
Stl!i  tlthreading  models'  The  average  virtual-processor  speedups  are  28% 

and  45%  for  the  software  and  hardware  multithreaded  schemes,  respectively. 

liarciware  multithreading  model  outperforms  the  software  model  for  al- 
Radm  the  largeSt  Perf°rmanCe  improvement  is  observed 
“  folIowed  by  CIS  (19.6%),  FMM  (16.5%),  Ocean  (15.4%).  LU 

(  •  %)  and  Water  (7.0%).  For  Radix,  the  hardware  multithreading  model  per¬ 
forms  as  well  as  the  hardware  model.  These  results  can  be  explained  with  a 
closer  analysis  of  the  execution  time  in  the  level- 1  processor. 

Figures  5,  6  and  7  show  a  breakdown  of  the  execution  time  in  one  of  the 
level-1  processors  into  three  components:  busy,  stalled  on  memory  accesses  and 

stalled  on  synchronization  (locks  and  barriers)  for  the  three  assignment  scenarios 
ot  figure  4. 

In  the  single-thread  case  (Figure  5),  the  high-performance  level-1  processor 
spends  most  of  its  execution  in  synchronization  points.  Since  this  assignment 
does  not  account  for  heterogeneity  in  processor  performance,  the  level- 1  proces¬ 
sor  is  often  waiting  to  synchronize  with  lower-level  (slower)  processors  to  proceed 
with  computation. 

In  the  software  multithread  case  (Figure  6),  the  level-1  processor  spends  less 
time  in  synchronization  relative  to  actual  computation.  The  load-balancing  prop¬ 
erty  of  the  virtual-processor  scheme  allows  the  level-1  processor  to  perform  more 
computation  before  attempting  to  synchronize  with  lower-level  processors,  and 
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Fig.  4.  Simulated  HDSM  speedups  (with  respect  to  level-3  processor)  for  single-thread 
and  virtual-processor  assignments  (software  and  hardware  multithreading  models). 


Execution  time  components,  single-thread 


Fig.  5.  Relative  contributions  of  busy,  memory  and  synchronization  to  total  execution 
time  of  a  level-1  processor  under  the  single-thread  assignment. 
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Execution  time  components,  MT-SW 


Fig.  6.  Relative  contributions  of  busy,  memory  and  synchronization  to  total  execution 
time  of  a  level-1  processor  under  the  virtual-processor,  software  multithreading  assign- 


hence  the  synchronization  component  is  reduced  significantly.  Since  the  proces¬ 
sor  spends  less  time  in  synchronization  points,  the  (relative)  busy  and  memory 
components  increase. 

.,  A  comPanso"  of  the  multithread  cases  (Figures  6  and  7,  respectively)  shows 
that,  for  all  benchmarks  (in  particular,  Cl  5  and  FFT),  the  relative  memory  ac¬ 
cess  component  gets  reduced  when  hardware  support  is  present.  This  is  explained 
by  the  ability  of  hardware  multithreading  to  hide  memory  latencies  by  overlap¬ 
ping  memory  accesses  from  distinct  threads.  The  improved  memory  behavior 
is  reflected  in  increased  processor  usage  (busy  component)  and,  ultimately,  in 
better  performance  over  the  software  scheme  as  shown  in  Figure  4. 

For  Radix,  the  hardware  scheme  fails  to  deliver  better  performance  for  the 
following  reason.  In  Radix ,  the  increased  frequency  of  context  switches  causes 
interference  in  the  level-1  cache,  increasing  the  worst-case  LI  miss  rate  in  pro¬ 
cessor  0  (HDSM  level  1)  from  9.7%  to  15.1%. 


5  Conclusions 

A  heterogeneous,  hierarchical  organization  of  processor  and  memory  resources 
ol  a  DSM  allows  efficient  execution  of  codes  with  various  degrees  of  parallelism. 

ns  organization  also  delivers  high-performance  for  unmodified,  homogeneous 
shared-memory  parallel  programs  that  exhibit  a  single  degree  of  parallelism 
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Fig.  7.  Relative  contributions  of  busy,  memory  and  synchronization  to  total  execution 
time  of  a  level-1  processor  under  the  virtual-processor,  hardware  multithreading  assign¬ 
ment. 


Support  for  the  execution  of  multiple  threads  in  the  high-performance  pro 
cessors  of  a  heterogeneous  DSM  is  key  to  delivering  high  performance  for  such 
homogenous  parallel  applications.  This  paper  shows  that  the  virtual-processor 
assignment  of  threads  to  nodes  that  are  heterogeneous  only  with  respect  to  ILP 
hardware  and  cache  sizes  improves  the  average  performance  of  HDSMs  by  up  to 
45%.  when  compared  to  a  single-thread  assignment  policy. 

This  paper  also  shows  that  hardware  support  for  hardware  block  multithread¬ 
ing  in  the  high-performance  upper-level  processors  is  desirable  for  an  HDSM  or¬ 
ganization.  A  simulation  analysis  shows  that,  hardware  multi-threading  improves 
the  performance  of  virtually-assigned  homogeneous  applications  in  HDSMs  by  as 
much  as  21%  (13%.  on  average)  over  a  software-based  context-switching  scheme. 

A  detailed  analysis  of  the  execution  in  the  multithreaded  upper-level  proces¬ 
sors  shows  that,  while  the  virtual-processor  thread  assignment  mechanism  is  able 
to  improve  load  balancing,  the  hardware  multithreading  solution  is  particularly 
effective  in  overlapping  high-latency  shared-memory  accesses  and  reducing  the 
memory  component  of  the  execution  time. 
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Abstract.  The  study  of  the  solution  of  the  Generalized  Sylvester  Equa¬ 
tion  and  other  related  equations  is  a  good  example  of  the  role  played  by- 
matrix  arithmetic  in  the  field  of  Modern  Control  Theory.  We  describe  the 
work  performed  to  develop  systolic  algorithms  for  solving  this  equation, 
in  a  fast  and  effective  way.  The  presented  results  show  that  the  design 
methodology  used  allowed  us  to  propose  the  use  of  Systolic  Libraries, 
that  is,  reusable  systolic  arrays  that  can  be  implemented  taking  profit  of 
the  use  of  FPGA  technology.  In  this  paper  we  show  how  it  is  feasible  to 
solve  the  Generalized  Sylvester  Equation  using  basic  modules  of  Linear 
Algebra  that  can  be  implemented  on  versatile  systolic  arrays. 


1  Introduction. 

The  Generalized  Sylvester  Equation,  AXB  +  CXD  =  E,  with  A,C  €  R  , 
B  D  £  Rnxn  and  X,  E  6  Rmxn,  and  some  simpler  derived  equations  such  as 
the  Sylvester[7],[15],[3]  Lyapunov  [13], [17]  and  Stein  [7], [15]  have  multiple  and 
important  applications  in  the  field  of  Control  Theory  [9], [7], [15]. 

Obtaining  the  solution  of  these  equations  is  a  suitable  problem  for  the  ef¬ 
ficient  use  of  parallel  algorithms,  due  to  the  regular  structure  of  the  matrices. 
However,  when  real-time  constraints  apply  to  the  system,  the  use  of  dedicated 
processors,  usually  implementing  systolic  algorithms  in  VLSI  is  required.  We 
have  recently  presented  several  works  [10], [12]  showing  that  a  modular  approach 
to  systolic  algorithms  is  a  suitable  way  of  building  fast,  reconfigurable  solutions 
to  be  implemented  in  FPGA  devices  to  obtain  cost-effective  custom  processors 
to  solve  different  problems. 

The  starting  point  is  a  new  design  methodology  [10]  based  on  the  Kronecker 
Product  and  Vec-Function  operators.  Algorithms  obtained  this  way  are  easy  to 
parallelize  because  they  consist  of  combinations  of  basic,  widely  studied  opera¬ 
tions  (Solve  a  triangular  equation  system,  Gaxpy,  Saxpv,  QR  decomposition  of 
a  Hessenberg  matrix,  . . . ) ,  and  the  required  data  flow  is  well  structured  to  pass 
from  one  functional  block  to  another  without  intermediate  storage. 
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Extending  these  results,  we  have  compiled  in  a  Systolic  Library  for  Linear 
Algebra  all  the  basic  modules,  following  the  same  principle  of  modular  program¬ 
ming  that  generated  other  sequential  and  parallel  environments  [1] ,  [18] .  For  the 
modules  of  this  library  [11]  to  be  useful  to  solve  any  problem  in  their  application 
field,  two  restrictions  hold:  (1)  all  the  systolic  arrays  must  share  a  compatible 
data  flow,  to  allow  results  from  one  of  them  be  forwarded  to  another,  and  (2)  the 
arrays  must  be  designed  to  process  problems  of  any  size.  These  two  restrictions 
have  been  satisfied  using  dynamic  arrays  and  applying  the  DBT  transformation 
[14J  on  the  basic  operations  of  the  linear  algebra. 

The  application  described  in  this  paper  is  a  good  example  of  the  use  of 
the  Systolic  Library.  The  first  step  to  solve  the  Generalized  Sylvester  Equation 
following  the  method  proposed  by  Golub,  Nash  and  Van  Loan  [41.  is  transforming 
the  original  problem  A'X'B'  +  C'X'D'  =  E',  into  AXB  +  CXD  =  E  using 
orthogonal  similarity  transformations  on  the  pencils  A'  —  A C1  and  D'  —  X B1  to 
oLtain  their  Generalized  Schur  Forms  (that  is,  Pf  (A  -  X C)P[  ~  A!  -  X C'  and 
Ql  (D  -  XB)Ql  =  D'  -  X B‘).  The  coefficient  matrices  of  the  resulting  equation 
are  m  a  condensed  form.  We  have  worked  on  the  solution  for  three  cases  [101- 
flrsL  when  all  of  them  are  triangular  ( Triangular  Case).  Second,  when  A  is  Schur 
or  Hessenberg  and  the  others  triangular  ( Hessenberg  Case).  Third,  when  both 
matrices  A  and  D  are  Schur  (, General  Case).  The  study  of  the  two  first  cases 
as  made  possible  the  development  of  the  basic  arrays;  the  study  of  the  general 
case  allowed  us  to  prove  how  the  collection  of  routines  obtained  were  efficient 
(and  sufficient)  to  solve  more  general  and  complex  problems. 

. ,  Sf  Co10n. 2  Pres(;nts  the  basis  of  the  methodology  for  developing  the  algorithms: 
the  definition  of  Kronecker  Product  and  Vector  Function  of  a  matrix.  Section 
6  describes  the  mam  operations  to  be  solved  when  studying  the  solution  of  the 
Generalized  Sylvester  Equation  in  the  General  Case.  Then  section  4  shows  how 
to  use  the  library  to  implement  this  operation.  Finally  section  5  concludes  and 
presents  the  ongoing  work. 


2  Applying  the  Methodology  of  Design. 

The  methodology  used  to  solve  the  Generalized  Sylvester  Equation,  described 
m  [10],  is  based  on  the  definition  of  the  Kronecker  Product  and  Vec-Function 
o  a  matrix.  The  properties  of  both  operators  [6]  can  be  applied  to  simplify 

^  Pr°^lem-  Concretelv,  by  applying  them  to  the  equation 
A/v+  CA  D  -  E,  the  linear  equation  system  ( BT  ®  A  -f  DT  ®  C)vec(  V)  = 
veC(E),  shown  in  figure  1,  is  obtained1.  The  resulting  system,  too  huge  to  be  of 
practical  implementation,  offers  a  clear  representation  of  the  data  dependencies 
and  a  simple  expression  of  the  basic  steps  required  to  solve  the  problem. 

tior,  JthT TTs  I"1-1511'  t0  UPPCT  trianSuIar  system’  suggests  the  applica- 
.  °  °f  th  ^ack  Substitution  Algorithm  to  solve  the  problem.  For  example,  an 
intuitive  and  simple  method  would  be  to  obtain  the  value  of  xn  and  then  update 

onTvtathe^rf  thefPenC.l]  °  ~  XB  has  lower  quasi-triangular  structure:  this  affects 
-  t0  the  order  of  resolution  and  helps  to  visualize  the  problem. 


-  634  - 


VECPAR  ’2000  -  4th  International  Meeting  on  Vector  and  Parallel  Processing 


Fig.  1.  Linear  Equation  System  obtained  by  applying  the  Kronecker  Product  and  the 
Vec-function  to  the  Triangular  Generalized  Sylvester  Equation. 


the  values  of  en_i,  . . . ,  ei  as  is  done  for  the  solution  of  a  triangular  system.  The 
resulting  procedure  is  shown  in  figure  2. 


Calculate  Q:  (Abii+Cdii ) Q  is  upper  triangular; 
Solve  ( (Abii+Cdii)Q) (QTxi)=ei; 
w:=(AQ)*(QTxi) ; 
v:=(CQ)*(QTxi) ; 

Xj_ :  =Q*  (QTxi)  ; 

for  j  :  =  i — 1  dovmto  1  do 

Update  ei :=ej-wbij-vdi j 

endf or ; 


Fig.  2.  SGH  Step:  Procedure  to  obtain  x,.  assuming  that  d,_ i,<  -  0. 


But  figure  1  also  shows  that  for  certain  elements  (for  example  x3),  that  simple 
procedure  cannot  be  applied  because  there  are  subdiagonal  elements  of  matrix 
D  (^23)  that  produce  subdiagonal  blocks  in  the  transformed  matrix.  It  is  then 
necessary  to  solve  at  once  two  columns  of  matrix  X  (x3  and  x2).  We  will  call 
this  new' operation  Solved.  Figure  3  shows  the  complete  procedure  to  solve  the 
equation. 

In  the  resulting  SGG  Algorithm  all  the  operations  but  Solved  are  basic 
operations  of  Linear  Algebra  and  they  can  be  directly  performed  on  the  arrays 
designed  in  the  systolic  library  described  in  [11].  In  fact,  the  SGH  step  is  the 
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Fig.  3.  The  resulting  SGG  Algorithm. 


basic  stage  of  the  Algorithm  for  solving  the  Hessenberg  case  [101.  Therefore  to 
continue  with  the  study  of  the  solution  of  the  General  case  it  is  necessary  to 
study  this  new  operation. 


3  The  SOLVE_2  Operation. 

For  the  efficient  implementation  of  the  Solved  operation  we  start  by  analyzing 

“riX-  M '  *  P0SSible  ““ta*  m=4S 
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We  have  followed  the  proposal  of  Golub,  Nash  and  Van  Loan  [41  to  reduce 

a  permutetion  ma arizing ,th^  matrix  (°(™3)  flops2).  Applying  to  the  problem 
a  permutation  matrix  such  that  it  transforms  1,2 . mn  into  l,n  +  l,2n  + 

■  •  •  - ,  (rn  l)n  +  l,  2,  n  +  2, 2n+2, (m-l)n  +  2, . . . ,  n.  2  n.  3  n, . . . ,  l)n  mn 

the  result  is  an  equivalent  problem  in  which  the  coefficient  matrix  is  an  upper’ 

2  thCpC^dingr°,th,e  °Id1definition  of  fl°Pg  [5],  41  =  oft]  +  6[i]  *  C[i],  to  better  compare 
t  e  sequential  algorithm  with  the  systolic  implementation. 


-  636- 


VECPAR  ’2000  -  4th  International  Meeting  on  Vector  and  Parallel  Processing 


triangular  matrix  with  two  non-zero  subdiagonals.  Using  that  transformation  in 
the  example,  the  result  is 
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Different  possibilities  were  considered  when  designing  the  corresponding  al¬ 
gorithm  to  avoid  the  construction  of  the  auxiliary  matrix  PTMP.  Two  were 
deeply  studied  due  to  their  feasibility: 


1. 


2. 


To  process  M  as  matrix  ( Abu  +  Cdu)  in  the  SGH  step.  The  basic  idea  in  the 
procedure  described  in  figure  2  is  to  look  for  a  compatible  data  flow  among 
the  operations  to  allow  a  systolic  implementation.  Then  the  transformation 
to  triangularize  the  coefficient  matrix  of  Solve  is  applied  by  columns.  In 
the  systolic  implementation  the  resulting  data  flow  allows  to  obtain  a  good 
chaining  between  Calculate  Q  and  Solve  operations,  that  stands  also  for 
Solve  and  Gaxpy;  and,  moreover,  there  is  no  need  to  form  an  auxiliary 
matrix,  working  in  terms  of  the  original  one.  Our  aim  was  also  to  keep  the 
original  matrices  in  the  Solve_2  operation,  following  for  the  triangularization 
the  reduction  order  imposed  by  the  permutation  of  M  in  eq.  2.  The  result 
was  the  design  of  a  sequential  algorithm,  SGG1  [10],  of  0(5m2n+mn2)  flops. 
To  process  At  in  a  similar  way  to  the  Back  Substitution  Algorithm,  obtaining 
the  values  of  columns  Xi  and  x,.—i  by  groups  of  two  elements  (corresponding 
to  zero  subdiagonal  elements  of  matrix  A)  or  four  elements  (corresponding 
to  non- zero  subdiagonals  entries  of  matrix  A).  That  must  be  done  due  to 
the  structure  of  M  in  eq.  1.  For  non-zero  entries  of  the  original  matrix  A 
(for  example  elements  043,  643  and  ^43  )  a  4  x  4  system  has  to  be  solved, 
obtaining  four  values  of  ith  and  i  -  Ith  columns  of  X.  For  entries  whose 
value  is  zero,  solving  a  2  x  2  system  two  values  of  ith  and  i  —  1  columns 
of  X.  The  corresponding  sequential  algorithm  [10]  has  a  temporal  cost  of 
0(m2n  +  mn 2)  flops. 

a33  b33  a34  b34  \  /  *3.1-1 

c33  d33  c34  d34  1  |  x3,i 
a43  b43  a44  b44  f  \  *4,*-l 
0  <*43  c44  d44  /  \  *4,i 


3.1  Obtaining  Systolic  Algorithms  for  the  Solve_2  Operation. 

The  previous  resolution  schemes  present  two  major  drawbacks  for  their  systolic 
implementation: 

1.  For  the  first  approach,  the  rotations  involve  columns  in  different  blocks  of 
the  original  matrices  (marked  in  bold  in  eq.  2);  therefore  it  is  necessary  to 


-  637  - 


FEUP  -  Faculdade  de  Engenharia  da  Universidade  do  Porto 


explicitly  form  all  the  linear  combinations  of  all  the  blocks  involved  in  the 
Update  of  other  columns  of  matrix  E.  It  is  impossible  to  form  the  auxiliary 
vectors  wl,  w2,  vl  and  v2  to  reduce  the  cost  of  Update. 

2.  For  the  second  approach  data  dependencies  are  so  strong  that  we  could  not 
find  an  efficient  systolic  algorithm  for  it. 


Therefore,  to  design  an  efficient  systolic  algorithm  for  the  Solve.2  operation, 
we  studied  the  reuse  of  those  obtained  for  simpler  cases.  When  solving  the  Trian¬ 
gular  and  the  Hessenberg  case,  two  basic  systolic  arrays  were  designed  [111.  The 
first  one,  called  Module  QR,  has  the  capability  of  performing  the  operation 


Calculate  Q  :  (a.4  +  /3B)Q  is  upper  triangular 


obtaining  AQ,  BQ  and  Q,  and  working  with  matrices  of  any  size.  The  description 
is  presented  in  figure  4.  If  a  =  l,  A  =  A,  0  =  0  and  B  =  I,  the  outputs  of  this 
operation  are  AQ  and  Q. 


Fig.  4.  Module  QR. 


The  second  one,  called  Module  Solve/GAXPY,  has  the  capability  of  si¬ 
multaneously  performing  the  operations 


Solve  {aA  +  f3B)a :  =  e  and  w  :=  A*x,  v  B  *  x 

also  working jith  matrices  of  any  size.  The  description  is  presented  in  figure  5. 

Q  1,  A  -  AQ,  0  _  0  and  B  =  Q.  among  the  outputs  of  this  operation  we 
have  v  =  x,  obtained  from  x  Q(QTx). 

It  is  then  possible  to  solve  the  General  case  of  the  Generalized  Sylvester 
Equation  using  the  SGH  step  when  a  subdiagonal  entry  of  the  matrix  D  is  zero 
and  using  the  following  procedure  when  a  subdiagonal  element  is  non-zero- 


1.  Construct  the  2m  x  2m  matrix  PT MP. 
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2.  Construct  the  corresponding  version  of  Identity  matrix:  starting  from 

£  _  /  Imxm  dm  X  m  \ 

\  Im  x  m  dm  X  m  J 

apply  on  it  the  same  permutation  (assuming  again  m=4), 

/I  1  0  0  0  0  0  0\ 

11000000 

00110000 

r  _  0  0  1  1  0  0  0  0  (3) 
^  “  00001100’ 

00001100 
00000011 
\0  0  0  0  0  0  1  1/ 

3.  Using  the  Module  QR  (a  =  1,  A  =  PT M.P ,(3  =  0  and  B  —  P  PP)  nullify 
the  two  subdiagonals  of  matrix  PTMP,  ( PTMP)Q  and  obtain  ( P  PP)Q , 

4.  Using  the  Module  Solve/GAXPY  (a  =  1,  A  =  (PTMP)Q,  P  =  0  and 
B  =  ( PT1P)Q )  solve  the  triangular  system  and  obtain  xx  and  i  from 
the  solution  of  the  system, 

5.  Using  the  Module  Solve/GAXPY  (A  =  A,  B  =  C  and  any  value  for  a  and 
0)  calculate  wl,  w2,  vl  and  v2  and  Update  the  matrix  E. 

This  procedure  can  be  entirely  implemented  with  the  proposed  systolic  arrays 
independently  of  the  size  of  the  coefficient  matrices  of  equation  AXB  +  CXD  = 
E. 


4  Systolic  Implementations  for  the  General  Case. 

The  basic  stage  of  the  systolic  computation  will  be  the  obtaining  of  a  column  of 
matrix  X,  xu  when  d^u  =  0  (SGH  step)  or  the  obtaining  of  two  columns  of 
matrix  X,  xf  and  m,_i ,  when  di-ij  #  0  (SGG  step). 

Figure  6  shows  how  to  combine  the  two  basic  modules  to  sohe  the  SGH 
step.  In  addition  to  the  Module  QR  and  the  Module  Solve/GAXPY  it  is  needed 
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a  special  cell,  called  GAXPY.2  to  complete  the  calculus  of  w  and  v.  accumulating 
on  them  the  corresponding  products  with  the  subdiagonal  elements  of  AQ  and 
CQ  (with  the  same  zero-structure  that  matrix  A);  it  is  also  necessary  an  array 
formed  by  SAXPh  cells  with  capability  of  performing  a  Saxpy  operation  to 
update  each  column  of  matrix  E.  This  update  is  made  up  with  the  value  of 
vector  <2  Xi  The  figure  does  not  show  the  calculation  of  Xi  from  this  value,  but 
it  can  been  performed  on  the  same  array,  introducing  only  the  Identity  matrix 
the  corresponding  rotations  and  the  vector  QTXi. 


Fig.  6.  Obtaining  xt  =  0). 


,fi_  71  nfTb  o  o°f  he  SGG  S  rP  18  f°rmed  by  the  successive  transformation 

is  considn  5^  \  TTir  P  MP-  T°  nullify  the  Second  subdiagonal,  it 

is  considered  the  matrix  Aux,  formed  only  by  the  (2m  -  1)  first  columns  of  the 

rigmal  matrix;  that  is,  it  is  a  Hessenberg  matrix  of  size  2m  x  (2m  -  1)  When 

the  subdiagonal  has  been  nullified,  the  matrix  Auxl,  also  of  size  2m  x  (2m  -  1) 

IhelnS T  the  ^  iS  neC6SSary  t0  add  the  last  «***  of 

aftei  thp  gam’  a  HessenberS  matrix,  of  size  2m  x  2m,  is  obtained  and 

after  the  process,  it  is  obtained  Aux3,  that  is  upper  triangular. 

igure  8  shows  the  complete  process  and  the  order  in  which  each  one  of 
these  auxiliary  matrices  is  processed.  Note  that  the  modules  are  of  size  m  so 
the  process  supposes  the  application  of  the  DBT  [14]  on  these  matrices.  The 

4  1  but  noeteTf°nfiCalCUlf\Q  ^  ^  widdy  discussed  in  subsection 

arrays  inT?np  :  “  '  matriC6S  are  CUt  in  blocks  of  the  size  of  the 

arravs  m  a  special  way,  making  two  blocks  share  a  column. 


4.1 


Size-Independent  Systolic  Implementation. 


Let  us  suppose  that  the  blocks  system  of  figure  1  is  made  up  by  N  x  N  upper 
Schur  blocks  M,j  +  Cd u,  of  size  M  x  M,  and  each  block  is  built  of  ,  x  ,  block" 
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(a)  Initial  matrix  (b)  Matrix  Aux  (c)  Matrix  Auxl  (d)  Matrix  Aux2  (e)  Matrix  Aux3 


Fig.  7.  Successive  transformation  of  the  matrix  PT MP ■ 


of  dimension  m  x  m,  being  N  =  pn  and  M  —  qm.  Let  us  also  suppose  that 
each  of  the  columns  of  X  and  E  will  be  built  of  q  blocks  of  size  m.  According 
to  this  block  structure,  we  will  identify  the  subblock  at  the  r  row  and  s  column 
from  the  (Ab,j  +  Cdtj )  block  with  the  notation  ( Arsbij  +  C7Sbij):  and  the  r 
subvector  from  the  ith  column  of  X,  xu  or  E,  eit  will  be  written  x\  or  e\.  This 
block  division  will  be  used  to  develop  a  block  oriented  process  to  solve  the 
Generalized  Sylvester  Equation;  the  described  situation  allows  the  decomposition 
of  operations  Solve,  Gaxpy  and  Update  to  process  blocks  of  size  m  x  m.  To 
decompose  the  operation  Calculate  Q  (and  Apply  Q)  it  is  necessary  to  realize 
that  there  can  exist  subdiagonal  elements  in  the  matrix  ( Abu  +  Cdu )  that  do 
not  belong  to  any  block.  In  order  to  nullify  them,  the  block  division  for  this 
operation  is  similar  to  the  one  depicted  in  figure  9:  two  consecutive  blocks  in  a 
row,  {Arsbu  +  Crsdu)  and  ( Ar's+1bu  +  Cr’s+1da),  share  a  column,  in  such  a  way 
that  we  can  calculate  and  apply  the  corresponding  rotations. 

Also,  to  perform  the  Update  operation,  the  following  block  division  for  the 
matrix  E  and  the  ith  row  of  matrices  B  and  D  must  be  considered: 

/ Eii  Ei 2  ■  •  •  Eip  \ 

_  E2 i  E2 2  •  ■  •  E2p  h, i:i  =  (bn  bi2  ■  •  •  blK  b,,K+ 1  ) 

E~  .  ’  di.ni  =  (dn  da  ■  ■  ■  diK  di.K+t ) 

\Eql  Eq2  ■  ■  ‘  Eqp  J 

Let  us  assume  K=((i-1)  DIV  n)  and  L=((i-1)  MOD  n).  Each  block  Eij  is  of 
size  (m  +  1)  x  n,  and  shares  a  row  with  the  corresponding  block  Ei+hj.  Each 
subblock  of  the  ith  row  of  B  and  D  has  n  elements,  except  for  the  subblocks 
bi.K+i  and  d^K+ i  which  have  L+l. 

To  solve  the  problem  in  the  size-independent  case,  the  Dense-to-banded 
Transformation,  DBT  [14],  has  to  be  applied  to  the  non-triangular  submatri¬ 
ces  involved  in  the  process.  The  DBT  obtains,  from  a  matrix  of  size  m  x  m. 
another  one  of  size  m  x  2m  or  2m  x  m,  but  with  bandwidth  m,  by  the  adequate 
juxtaposition  of  the  upper  and  lower  triangles  of  the  matrix.  In  the  present  prob¬ 
lem  it  is  necessary  to  find  a  common  DBT  to  all  the  operations,  so  the  second 
possibility  must  be  chosen. 

As  in  the  size-dependent  algorithm,  the  basic  stage  will  differ  depending 
whether  it  is  found  that  is  zero  or  not.  When  d,- 1 —  0.  the  basic  stage  is 
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Fig.  9.  (a)  Block  division  for  the  Solve  and  Gaxpy  operations,  (b)  Block  division  for 
Calculate  Q  and  Apply  Q  operations. 


the  calculation  of  xf,  shown  in  figure  10.  This  process  is  divided  into  two  steps. 
First  the  obtaining  of  {Qs)Tx f.  Then,  two  different  operations  on  different  data 
are  required:  the  Apply  Q  and  Gaxpy  operations  to  preprocess  the  w  and  v 
vectors  for  later  stages,  and  the  update  of  E  with  regard  to  the  calculated  value. 
It  is  supposed  that  when  obtaining  {Qs)Tx f  the  control  signal  is  kept  high  in  the 
QR  and  Solve/GAXPY  modules;  afterwards  it  goes  low  to  start  the  preprocess, 
which  is  developed  simultaneously  with  the  updating  on  the  n  SAXPA  cells 
array'  In  the  Update  operation  they  will  be  involved  the  first  L  subcolumns 
of  the  Es>k+i  block  and  the  K  first  blocks  (from  EsK  to  Esi).  During  this 
operation,’  the  O(n)  array  has  to  receive  as  inputs  the  required  K  copies  of  ws 
and  vs  to  complete  the  calculation.  To  do  that,  we  can  use  the  GAXPY-2  cell, 
depending  of  the  value  of  a  control  signal  (independent  from  the  signal  managing 
CALCULATE-Q  and  SOLVE  cell)  it  selects  inputs  to  the  GAXPY  array  from 
the  SOLVE  cell  or  from  memory. 

When  di-i ,  /  0  and  the  Solve_2  operation  must  be  block  oriented,  the  matrix 
Ad  must  be  also  divided  into  blocks;  the  notation  to  be  used  will  be. 


MT 


=  ( 


Arsh 


+  Crsdi-ii 
Crsdi- 1,* 


Arsbi  i_ !  +  Crsdi4- 1 
Arsbn  +rs  Cdu 


(4) 


In  this  case  the  basic  stage  will  obtain  and  x\.  Blocks  are  introduced 
in  the  order  suggested  by  figure  10,  but  taking  into  account  that  each  diagonal 
block  is  processed  as  shown  in  figure  8  and  each  dense  block  as  shown  in  figure  1 1 . 
Once  Xi  y  Xj-i  have  been  obtained,  blocks  of  matrices  A  and  C  are  introduce 
into  the  array  in  the  order  shown  by  figure  10  to  complete  the  update  of  blocks 
EsK ,■■■■.  Es  i  while  obtaining  wls,  w2s,  d1s  and  v2s. 

This  theoretical  scheme  could  be  optimized  in  the  systolic  implementation  by 
overlapping  stages,  taking  profit  of  the  2  -  slow  data  flow  as  well  of  the  existence 
of  operations  without  data  dependencies  (for  instance,  in  Solve.2  during  the 
Update  of  matrix  E  or,  if  two  consecutive  Solve_2  have  to  be  applied,  the  Update 
part  of  the  first  can  be  delayed  until  the  beginning  of  the  second,  increasing  the 
efficiency  of  the  SAXPA  cells). 
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5  Conclusions  and  Future  Work 

We  have  shown  how  the  Generalized  Sylvester  Equation  and  its  derived  equations 
can  be  systematically  solved,  using  systolic  blocks  that  perform  basic  operations 
the  linear  algebra,  and  that  form  a  complete  Systolic  Librarv.  This  method  of 
solving  these  equations  has  been  obtained  by  means  of  a  new  design  methodology 
1  S  advan1fge  is  the  modularity  of  the  obtained  solution,  that  allows  to 
apply  the  same  design  principles  used  in  software  development.  The  methodology 
as  been  applied  to  other  equations  derived  from  that,  in  the  shown  cases  and  in 
the  case  of  A  being  a  Hessenberg  matrix  [10],  and  all  of  them  can  be  solved  with 
the  basic  arrays  described  in  this  paper.  These  results  have  been  used  to  design 
a  complete  Systolic  Library  [11]  with  the  capability  of  solving  a  wide  variety  of 
problems  m  the  field  of  matrix  algebra. 

The  work  is  being  further  extended  in  three  different  directions:  the  identifi- 
catmn  of  others  fields  to  apply  the  same  design  methodology,  the  implementation 
he  Systolic  Library  in  FPGA  devices  and  the  automation  of  the  process  to 
directly  obtain  the  FPGA  configuration  from  the  high  level  specification  of  the 
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Abstract.  The  2D  packing  problem  is  a  NP-hard  problem  with  ap¬ 
plications  in  various  industries,  from  apparel  to  ship  building.  Current 
computer  based  approaches  still  rely  on  interactive  software  and  pick- 
and-drag  procedures  performed  by  experienced  people.  Semi-automatic 
commercial  systems  already  exist,  but  to  obtain  a  final  good  solution 
it  is  still  necessary  refinements  of  the  solutions  obtained  automatically. 
Searching  autonomously  for  good  solutions  in  reasonable  execution  times 
requires  optimization  approaches  that  rely  on  the  generation  and  eval¬ 
uation  of  a  large  number  of  solutions.  To  accelerate  this  process,  a  re- 
configurable  and  parallel  computing  subsystem  was  built  that  works  as 
an  auxiliary  processor  for  low-cost  desktop  PC  computers.  This  paper 
presents  briefly  the  architecture  of  the  auxiliary  processor  and  the  exper¬ 
imental  results  obtained  by  different  approaches  to  parallelize  the  target 
problem  into  this  parallel  architecture. 


1  Introduction 

The  2D  packing  problem  is  a  NP-hard  problem,  consisting  in  finding  a  distri¬ 
bution  of  a  given  set  of  irregular  shapes  over  a  limited  space.  Good  solutions, 
although  normally  sub-optimal,  are  the  ones  that  lead  to  minimum  waste  of  the 
area  available  for  placing  the  shapes.  The  particular  instance  of  this  problem  ad¬ 
dressed  in  this  work  applies  to  the  textile  industry,  where  the  placement  area  is  a 
width  limited  rectangular  sheet  of  fabric,  and  the  global  objective  is  to  minimize 
the  length  of  the  region  used  by  a  particular  solution. 

Fully  automatic  approaches  targeted  to  industrial  environments  must  achieve 
at  least  the  same  results  as  the  traditional  solutions  built  by  hand,  using  inter¬ 
active  software  applications  based  on  pick-and-drag  procedures.  Because  the 
NP-hard  nature  of  the  problem,  it  is  impossible  to  guarantee  the  optimality  of 
one  solution.  However,  good  solutions  may  be  found  by  using  meta  heuristic 
search  procedures,  like  local  search,  tabu  search  or  simulated  annealing.  These 
techniques  rely  on  the  construction  and  evaluation  of  a  large  number  of  complete 

*  This  work  was  partially  funded  by  the  Portuguese  government  under  the  PRAXIS 
XXI  Program  (Project  nr.  P017-P3. lb-09/97  -  AUTOMARC)). 
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solutions,  in  order  to  guide  the  search  algorithm-  General  approaches  for  these 
techniques  start  from  an  initial  but  feasible  solution,  and  search  for  better  solu¬ 
tions  in  its  vicinity,  according  to  the  different  criteria  used  by  these  procedures. 
Usually,  neighbor  solutions  are  generated  by  doing  elementary  modifications  in 
parameters  that  characterize  that  solution. 


In  the  2D  packing  (or  nesting)  problem,  one  solution  can  be  represented  by 
the  set  of  coordinates  occupied  by  the  polygons  that  form  the  problem  data. 
One  neighbor  solution  may  be  generated  by  simply  moving  one  piece  a  small 
distance  in  a  certain  direction,  and  re-arranging  the  others  to  keep  the  solution 
feasible  (i.e.  avoid  overlaps  among  the  polygons).  Although  this  is  relatively 
easy  to  do  by  hand  with  hard  paper  molds  or  interactive  computer  applications, 
the  amount  of  computation  required  to  perform  this  operation  automatically  is 
too  high.  In  this  problem,  the  critical  time  consuming  tasks  are  the  low  level 
geometric  operations  that  analyze  the  relative  positions  between  polygons  and 
detect  possible  overlaps  that  may  turn  a  solution  unfeasible. 


To  accelerate  existing  optimization  approaches  for  this  problem  based  on 
meta-heuristics  [1],  a  custom  auxiliary  processor  for  PC  computers  has  been 
built,  based  on  an  array  of  dedicated  processing  nodes  (PPK— Polygon  Position¬ 
ing  Kernel)  and  a  programmable  processor  (FCP-Fafner  Control  Processor. 
Ihe  PPK  nodes  are  custom  digital  circuits  that  perform  efficiently  the  detection 
o  intersections  among  polygons,  thus  providing  support  to  handle  efficients 
the  polygon  datatype.  The  FCP  processor  executes  a  stored  program  that  im¬ 
plements  a  nesting  heuristic  to  build  a  complete  solution,  making  use  of  that 
parallel  infrastructure  to  verify  the  feasibility  of  solutions.  This  custom  comput- 
mg  machine  is  called  Fafner  ( Flexible  Architecture  For  NEsting  pRoblems)  [2. 
j,  and  interfaces  with  the  higher-level  optimization  software  running  in  the  PC. 


A1S  aUX! ha/I processor  1S  built  on  a  reconfigurable  digital  system  based  on 
bPGA  circuits  (Field  Programmable  Gate  Array).  The  flexibility  afforded  with 
such  implementation  platform  allowed  several  design  iterations  on  the  hardware 
domain,  to  experiment  with  and  evaluate  different  strategies  that  enabled  the 
efficient  exploitation  of  the  computing  power  available  in  this  system. 


This  paper  presents  the  results  obtained  with  two  heuristic  approaches  to 
build  solutions  for  the  2D  packing  problem,  and  the  different  strategies  used 
to  parallelize  them  in  the  array  of  PPK  nodes.  The  remainder  of  this  paper  is 
organized  as  follows.  Section  2  detail  the  core  geometric  operations  involved  in 
this  class  of  problem,  and  the  common  procedures  that  are  normally  used  for 
the  type  of  application  addressed  in  this  work.  Section  3  presents  the  hardware 
organization  of  Fafner  ,  and  describes  the  overall  operation  of  the  system.  In 
section  4,  different  approaches  to  the  parallelization  of  this  problem  on  the  target 
custom  computer  and  the  corresponding  results  are  presented  and  discussed, 
mallv,  m  section  5  the  final  conclusions  are  drawn,  as  well  as  suggestions  for 
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2  Geometric  operations 

Although  there  are  various  techniques  to  build  solutions  for  the  nesting  prob¬ 
lem  [1,4,6, 7, 5],  the  approach  used  in  this  work  places  polygons  on  a  discrete 
grid  with  acceptable  size  (for  example,  lxl  mm  is  far  enough  for  textile  in¬ 
dustries),  and  moves  one  polygon  (the  working  polygon )  one  grid  unit  at  a  time, 
checking  the  solution  for  feasibility  for  each  new  position  occupied  by  it.  The 
way  polygons  are  moved  and  placed  into  a  final  and  non-overlapping  position  is 
defined  by  a  nesting  heuristic. 


initial  polygon  list 

/^[Fn  [>-£=□- 

swap  two  polygons^X^^ 

modified  polygon  list 


Fig.  1.  Different  solutions  by  changing  the  order  of  the  polygon  list. 


In  this  work,  one  solution  is  completely  specified  by  an  ordered  list  of  poly¬ 
gons  (the  polygon  list )  and  the  nesting  heuristic  that  is  followed  to  arrange  them 
(figure  1).  Polygons  are  picked  in-order  from  that  list  and  moved  in  the  place¬ 
ment  area,  following  rules  determined  by  the  heuristic.  Using  the  same  nesting 
heuristic,  different  solutions  may  be  created  by  doing  local  modifications  in  the 
order  of  the  polygon  list.  A  simple  procedure  to  create  neighbor  solutions  con¬ 
sists  in  selecting  randomly  two  different  polygons  in  the  polygon  list  and  swap 
their  positions.  More  sophisticated  neighbor  generation  procedures  can  exploit 
relationships  among  different  polygons  such  as  area  or  shape,  to  favor  certain 
types  of  solutions. 

In  what  concerns  the  FAFNER  system  configured  with  a  given  heuristic,  one 
solution  is  only  represented  by  the  polygon  list.  The  FAFNER  processor  receives 
this  list  from  the  optimization  software  running  in  the  PC,  computes  one  com¬ 
plete  solution  and  returns  the  cost  of  that  solution.  In  the  present  implementa¬ 
tion,  the  polygon  list  is  represented  by  a  128  byte  vector  and  the  result  is  one  16 
bit  integer  that  measures  the  length  of  the  rectangular  placement  area  used  by 
that  solution.  This  small  amount  of  data  transferred  between  the  host  computer 
and  the  auxiliary  processor  for  each  solution,  represents  a  negligible  processing 
time  overhead  that  is  used  for  data  transfer. 

The  main  task  of  the  FCP  processor  is  to  move  polygons  on  the  placement 
area,  thus  implementing  the  nesting  heuristic.  To  check  for  feasibility,  FCP  calls 
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in  parallel  the  array  of  PPK  nodoe.  Each  PPK  node  stores  locally  a  list  of  polv- 
gons  already  placed  into  their  final  positions,  and  verifies  the  overlap  condition 
between  the  working  polygon  aad  its  own  list:  of  poJygons. 

I  he  core  operate®  performed  by  each  PPK  node  is  to  verify  if  the  working 
polygon  overlaps  each  one  of  its  stored  polygons.  This  verification  is  performed 
sequentially  toe  .all  polygons,  or  until  one  overlap  is  found.  This  verification  is 
“  ?Tr+?aS?'  FirSt’  tbe  reiative  Positions  of  their  bounding  boxes  are 

nJSnrZd  t  A7,?  OVerlap’  &  ?°re  detailed  edge-by-edge  comparison  must  be 
p  ormed.  To  further  speedup  this  process,  edges  are  grouped  into  second-level 
ounAru?  boxes-SLBB  that  are  checked  first,  before  comparing  pairs  of  edges 
of  the  two  polygons.  To  verify  if  two  edges  intersect,  their  bounding  boxes  are 

Dfcnctens  an?  a  “°re,CTplex  and  time  consuming  procedure  based  on 
unctions  [8]  is  started  only  if  it  is  necessary. 

,  ,Thl f  hierarchlcal  Procedure  saves  large  amounts  of  computation.  For  an  in- 
dustna  problem  optimized  with  a  tabu  search  procedure,  near  6,000  millions 
pairs  polygons  are  analyzed,  but  only  7.3%  are  required  for  the  edge-bv- 

near  fiTonn518'-^  6  T  ™mhe*of  P°lyg0n  edges  checked  for  intersection  Is 
near  81,000  millions  but  only  0.6%  of  this  number  require  the  more  comnlex  D 

ofnQ60O1danalySH  nThK  Pr°blem  lmS  48  P0lyg0ns’  10  different  shapes  and  a  total 
of  960  edges  and  has  been  adopted  as  the  main  benchmark  used  to  evaluate  the 
various  implementations  created  in  this  work. 

As  a  result  of  this  hierarchical  procedure,  the  time  spent  in  each  processing 
step  !S  very  different:  a  bounding  box  comparison  takes  only  one  dock  cycle 

or  not  h  m°ayShnS’  °r  edg6s)’  bUt  t0  conclude  if  two  edges  intersect 

or  not  it  may  be  necessary  up  to  11  clock  cycles.  Because  of  this,  the  time 

quired  to  evaluate  the  overlap  condition  of  two  polygons  varies  from  1  clock 

ye  to  a  worst  case  that  must  perform  the  exhaustive  edge-by-edge  analysis 

and  exceeds  IV  x  A/  x  11  clock  cycles,  where  N  and  M  represL? the “  mX 

is  S  Tf  P°  ygon-  The  actIlal  processing  time  required  for  this  operation 
is  thus  dependent  on  various  factors:  the  relative  position  of  the  polygons,  their 
shape  and  the  number  of  edges  that  are  associated  into  SLBBs. 

3  Fafner  architecture 

I’d  r"lerr  SyS,?TS  “mposed  by  tw°  processing  units:  the  FCP 

a  custom  ‘  ^  }  |2’31'  The  FCP  (FaFNER  CM  Processor)  is 

a  custom  stored  program  processor  that  executes  a  program  implementing  the 

nesting  heuristic.  This  architecture  uses  an  instruction  set  and  a  „v  orga 

desisned  for  this  appiicati°"-  e"ab“"s  ti>e 

Structfons  imnlerl  ,  7“  *  co"ve,uent  datatype.  Custom  low-level  in- 

tructions  implement  the  communication  with  the  array  of  PPK  nodes  through 
a  reduced  set  of  commands  accepted  bv  them.  ’  S 

uatSS^  rdes  CO”!ti,tU,e  •  parallel  ™ghre  dedicated  to  the  eval- 
uation  of  intersections  between  polygons.  This  array  communicates  with  FCP 

through  a  common  bus.  plus  an  additional  circuit  th„  works  as  a  cTncemrator  of 
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Fig.  2.  The  Fafner  system  architecture. 


the  responses  of  each  processing  node.  Each  node  interprets  commands  issued  by 
FCP  that  define  processing  options,  load  polygon  data  into  PPK  nodes,  assign 
a  position  for  the  working  polygon  and  execute  the  check  for  overlap  procedure 
against  the  set  of  placed  polygons  stored  into  the  local  memory  of  each  PPK 
node.  The  addressing  mechanism  allows  commands  and  data  being  sent  to  only 
one  PPK  node  or  broadcast  for  all  PPK  nodes  present  in  the  array.  Present 
implementation  includes  4  processing  nodes,  although  the  current  version  may 
be  extended  up  to  12  nodes  without  requiring  any  modification  in  the  physical 
hardware  system. 

The  concentrator  circuit  manages  the  responses  of  all  the  PPK  nodes  con¬ 
nected  to  it.  When  the  PPK  array  is  called  to  check  for  overlap,  each  PPK 
node  works  in  parallel  with  different  sets  of  data.  A  direct  consequence  of  this 
is  that,  in  the  general  case,  each  PPK  node  will  terminate  its  processing  within 
different  times,  depending  on  the  geometric  relationships  between  the  working 
polygon  and  the  set  of  polygons  it  is  checked  with.  Moreover,  the  results  of  that 
processing  (either  overlap  detected  or  overlap  not  detected)  may  be  different  for 
each  node.  The  concentrator  implements  a  custom  circuit  that  manages  all  the 
status  signals  output  by  the  PPK  nodes,  and  feed  appropriate  responses  to  the 
FCP  processor.  The  complexity  of  this  function  depends  on  the  nesting  heuristic 
and  the  way  it  is  implemented.  This  varies  from  a  three  4-input  logic  gates  to  a 
much  more  complex  control  circuit  that  compares  the  positions  of  the  working 
polygon  in  all  the  PPK  nodes  to  determine  which  one  has  found  the  best  feasible 
position. 

The  Fafner  system  has  been  implemented  in  a  reconfigurable  system  built 
with  XILINX  FPGA  devices  [9]  and  additional  memory  chips.  A  library  of  in¬ 
terface  routines  has  also  been  created  to  support  the  development  of  application 
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programs  that  use  this  infrastructure.  The  fast  reconfiguration  of  this  family 
of  programmable  chips  enabled  easy  and  fast  design  iterations  of  the  hardware 
system.  During  this  development,  this  has  been  crucial  to  tune  up  and  improve 
the  efficiency  of  the  hardware  architecture  of  both  the  FCP  and  PPK,  without 
requiring  any  modification  in  the  physical  hardware  platform.  Besides,  various 
implementations  of  the  whole  system  with  specific  optimizations  for  different 
euristics  have  been  developed  and  can  be  programmed  in  a  matter  of  seconds, 
figure  3  shows  a  picture  of  the  Fafner  system  with  4  PPK  nodes. 


4  Parallel  approaches  to  nesting  problems 

Because  the  critical  time  consuming  task  is  the  check  for  overlap  operation, 
one  important  issue  is  how  to  dispatch  and  schedule  these  operations  by  the 
K  nodes,  in  order  to  exploit  efficiently  the  computing  power  available  in  the 
system.  The  techmque  used  to  build  one  solution  and  the  strategy  adopted  to 
i  nbute  the  polygons  by  the  PPK  nodes  are  important  factors  that  largely 
influence  the  effective  gam  in  speed  by  using  various  PPK  processors  working 
in  parallel.  The  various  approaches  that  are  being  experimented  in  the  scope  of 
ns  vork,  and  the  results  obtained  are  presented  in  the  next  subsections.  All 
the  techniques  implemented  represent  a  solution  by  an  ordered  list  of  polygons 

as  referred  above  in  section  2.  ‘  s  ’ 

The  benchmark  adopted  in  this  work  is  an  industrial  problem  taken  from  a 

Sn  eLm  rn  18  °rmed  by  48  P°lyg°nS’  10  different  shapes  and  a  total  of 
9b0  edges.  These  results  were  obtained  with  an  implementation  of  the  Fafner 

system  running  at  10  MHz,  for  a  sequence  of  20  different  solutions  generated 
randomly,  creating  neighbor  solutions  by  swapping  two  polygons  in  the  polygon 
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4.1  The  right-to-left  algorithm 

A  first  nesting  heuristic  implemented  in  FCP  and  described  in  [1]  consists  in 
pushing  polygons  from  right  to  left,  seeking  for  a  leftmost  feasible  position  for 
each  polygon.  A  new  polygon  is  first  positioned  into  a  leftmost  position  that  do 
not  overlap  any  of  the  other  polygons  already  placed.  Then,  it  is  moved  to  the 
left  one  unit  at  a  time,  while  checking  for  overlap  with  the  other  placed  polygons. 
If  an  overlap  is  found,  the  previous  position  is  restored  and  additional  up  and 
down  moves  are  tried  until  a  final  position  is  defined  for  that  polygon.  Figure  4 
illustrates  the  path  followed  by  one  polygon  until  its  final  position  is  reached. 


Fig.  4.  The  right-to-left  nesting  heuristic. 


To  parallelize  the  evaluation  of  intersections  by  the  PPK  nodes,  the  set  of 
polygons  already  placed  is  distributed  evenly  by  all  the  nodes.  When  the  algo¬ 
rithm  run  by  FCP  needs  to  verify  the  intersection  of  a  new  polygon  against  all 
the  placed  polygons,  the  check  for  overlap  command  is  broadcast  for  all  the  PPK 
nodes.  This  operation  terminates  as  soon  as  one  node  detects  an  intersection, 
or  when  all  the  nodes  conclude  the  processing  without  finding  any  overlap  with 
their  own  set  of  polygons.  When  a  final  position  is  established  for  the  working 
polygon,  it  is  stored  in  the  local  memory  of  a  PPK  node  selected  by  the  nesting 
heuristic.  In  this  approach,  this  is  done  in  a  cyclic  fashion  to  distribute  them 
evenly  by  all  the  PPK  nodes. 

With  this  strategy,  the  complex  operation  that  checks  one  polygon  for  overlap 
against  a  set  of  polygons  is  well  distributed  by  all  the  PPK  nodes,  each  one 
working  in  parallel  with  disjoint  sets  of  polygons.  As  shown  in  the  example 
of  figure  5,  when  the  16  check  for  overlap  operations  have  to  be  performed, 
this  strategy  divides  the  number  of  check  for  overlap  operations  by  the  number 
of  PPK  nodes.  If  these  operations  require  approximate  processing  times,  this 
procedure  will  also  divide  the  global  processing  time  by  the  number  of  processing 
nodes  available  in  the  system. 

However,  practical  results  have  shown  little  improvements  in  the  overall  per¬ 
formance  when  using  four  instead  of  one  PPK  node.  Further  analysis  of  these 
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The  list  of  placed  polygons 


overlap  detected  in  PPK  2  working  polygon 


Fig.  5.  The  parallelization  strategy  in  the  right-to-left  nesting  heuristic. 


results  have,  shown  that  this  is  due  to  the  disparity  of  processing  times  required 
or  the  evaluation  of  intersections,  as  it  was  referred  above.  While  the  4  PPK 
nodes  actually  start  their  computation  in  parallel,  in  the  majority  of  cases  only 
one  node  requires  the  complex  edge  by  edge  analysis,  and  all  the  other  PPKs  ter¬ 
minate  their  processing  based  only  on  the  analysis  of  bounding  boxes,  within  a 
few  number  of  clock  cycles.  In  this  situation,  the  overall  processing  time  is  clearly 
dominated  by  the  work  of  a  single  node  that  performs  the  complex  edge-by-edge 
analysis,  or  even  by  the  code  run  in  FCP  when  all  nodes  conclude  their  processing 
by  analyzing  only  bounding  boxes,  either  of  polygons,  SLBBs  or  edges. 

Table  1  presents  the  average  execution  times  required  by  the  combined  hard¬ 
ware/software  system  to  build  one  complete  solution.  The  slight  6.8%  reduction 
m  the  execution  time  achieved  by  using  4  PPK  nodes  instead  of  a  single  node 
does  not  justify  the  investment  of  the  additional  processing  nodes.  This  improve¬ 
ment  is  even  reduced  to  almost  zero  for  simpler  benchmarks,  where  the  software 
run  m  the  PC  and  in  FCP  far  dominates  the  global  processing  time. 

This  procedure  creates  solutions  of  the  nesting  problem  with  short  execu¬ 
tion  times  because  most  cases  of  the  critical  check  for  overlap  operations  are 
determined  by  analysis  of  bounding  box.  However,  because  backtracking  and 
unfeasible  solutions  are  not  allowed  during  the  positioning  of  polygons,  there 
are  severe  limitations  that  constrain  the  quality  of  the  solutions  generated  For 
example,  smaller  polygons  cannot  travel  over  larger  polygons  to  be  placed  in 
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Table  1.  Execution  times  for  the  right-to-left  algorithm 


Number  of  PPK  nodes 

4 

3 

2 

1 

Execution  time  (sec) 

0.463 

0.472 

0.478 

0.497 

Improvement  (%) 

6.8% 

5.1% 

00 

CO 

— 

blank  areas  that  could  be  left  between  the  larger  polygons.  Because  of  these 
limitations  and  the  bad  utilization  of  the  parallel  array  of  PPK  nodes,  other 
nesting  heuristics  were  implemented  that  achieved  much  better  results. 


4.2  The  raster  algorithm 

Another  approach  is  based  on  the  algorithm  referred  in  [5].  This  nesting  heuristic 
also  places  one  polygon  at  a  time,  but  searches  exhaustively  the  placement  area 
to  find  the  leftmost  position  were  the  moving  polygon  may  be  placed.  This 
technique  solves  the  problems  referred  above  and  yields  much  better  results  in 
terms  of  quality,  exploiting  better  the  parallelism  of  the  PPK  array. 

A  new  polygon  is  placed  first  in  the  upper  left  corner  of  the  placement  area, 
and  a  top  to  bottom,  left  to  right  raster  is  performed  one  unit  at  a  time,  until 
a  feasible  position  is  found.  To  avoid  unnecessary  steps  and  speedup  the  overall 
processing,  the  starting  position  of  a  new  polygon  may  be  set  to  the  final  position 
found  for  the  last  polygon  with  the  same  shape.  This  technique  follows  the 
same  procedure  to  distribute  the  evaluation  of  intersections  by  the  processing 
nodes.  The  FCP  processor  manages  the  movement  of  the  working  polygon  on 
the  placement  area,  and  calls  the  array  of  PPK  nodes  to  determine  if  there  is 
any  overlap.  However,  because  the  path  followed  by  a  polygon  requires  more 
frequently  the  more  complex  edge  analysis  against  polygons  stored  into  different 
PPK  nodes,  the  average  improvement  in  the  execution  time  achieved  with  4 
PPK  nodes  is  increased  to  12%,  using  a  cyclic  distribution  of  the  polygons  by 
the  PPK  nodes.  The  operation  of  this  nesting  heuristic  is  illustrated  in  figure  6, 
and  the  results  obtained  are  presented  in  table  2. 


Table  2.  Execution  times  for  the  raster  algorithm. 


Number  of  PPK  nodes 

4 

3 

2 

1 

Execution  time  (sec) 

53.3 

56.0 

58.0 

60.6 

Improvement  (%) 

12.0% 

7.6%. 

4.2% 

— 

Although  the  processing  times  required  by  this  heuristic  to  build  solutions 
are  more  than  115  times  worst  than  the  previous  approach,  the  quality  achieved 
by  these  solutions  is  much  better.  In  most  situations,  the  first  solution  found 
by  this  technique  is  already  better  than  the  best  solution  encountered  with  the 
previous  approach  after  a  large  number  of  iterations,  typically  never  below  1000. 
With  this  heuristic  embedded  in  a  simulated  annealing  search  procedure,  a  new 
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Fig.  6.  The  raster  nesting  heuristic. 


optimal  was  found  for  a  synthetic  benchmark  first  proposed  in  [1]  and  commonly 
used  to  evaluate  2D  packing  algorithms. 

4.3  Refinements  to  the  raster  algorithm 

In  order  to  speedup  the  previous  approach,  the  Fafner  architecture  was  re¬ 
designed  to  move  into  the  PPK  nodes  functionalities  previously  accomplished 
by  software  run  in  FCP.  The  increment  of  the  coordinates  that  define  the  posi- 
icm  of  the  working  polygon  to  perform  the  raster  movement  was  implement  in 
the  PPK  nodes  as  dedicated  logic  circuits  based  on  binary  counters  and  com¬ 
parators.  this  added  a  very  small  complexity  to  the  PPK  nodes,  but  enabled 
a  significant  reduction  in  the  number  of  instructions  executed  bv  FCP  in  the 
cycle  that  searches  for  the  final  position  of  one  polygon.  With  this  move  into  the 
hardware  domam,  the  mam  cycle  executed  by  FCP  just  need  to  issue  a  sequence 
of  check  for  overlap  instructions  to  the  PPK  array  and  analyze  the  results.  Each 
PK  node  automatically  increments  the  current  position  of  the  working  polygon 
m  a  single  system  clock  cycle. 

As  the  heuristic  procedure  is  the  same,  the  improvements  in  the  execution 
times  obtained  with  this  refinement  are  due  only  to  the  reduction  of  the  number 
of  instruction  in  the  main  loop  executed  by  FCP. 

Table  3  presents  the  results  obtained  with  this  implementation.  The  overall 
execution  times  were  reduced  to  approximately  50%  of  the  previous  implemen¬ 
tation,  and  the  execution  time  with  4  PPK  nodes  is  reduced  by  more  than  28% 

when  compared  to  a  single  PPK  node. 

4.4  A  new  approach  to  the  parallelization  of  the  raster  algorithm 

In  spue  of  the  improvement  achieved  with  the  previous  implementation,  the 
utilization  of  the  PPK  processing  node  is  far  away  from  the  ideal.  In  that  im- 
p  ementation,  the  lack  of  efficiency  with  the  number  of  processing  nodes  is  also 
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Table  3.  Execution  times  for  the  improved  raster  algorithm 


Number  of  PPK  nodes 

4 

3 

2 

1 

Execution  time  (sec) 

24.0 

25.1 

27.1 

33.7 

Improvement  (%) 

28.8% 

25.6% 

19.6% 

— 

related  to  the  disparity  of  execution  times  each  PPK  takes  to  detect  the  overlap 
condition.  Contrary  to  the  method  used  in  the  first  heuristic  (see  section  4.1), 
the  raster  heuristic  moves  polygons  over  unfeasible  positions,  thus  requiring  in 
most  cases  the  lower  level  and  time  consuming  edge-by-edge  analysis.  Because 
all  PPK  nodes  are  started  at  the  same  time  to  compute  the  overlap  condition 
with  the  working  polygon  in  the  same  position,  the  final  result  can  only  be  deter¬ 
mined  when  all  nodes  conclude  their  processing.  If  one  or  more  nodes  terminate 
in  a  short  time  because  they  didn’t  found  any  overlap,  they  must  be  kept  in  a 
idle  state  to  wait  for  the  completion  of  the  slowest  node.  After  that,  a  new  check 
for  overlap  may  be  initiated  in  the  next  position. 

To  further  improve  the  utilization  of  the  PPK  array,  the  complete  loop  that 
issues  the  check  for  overlap  operations  to  the  PPK  array  was  also  moved  into 
hardware  and  implemented  in  the  PPK  nodes.  This  way,  each  PPK  node  can 
search  autonomously  a  final  position  for  the  working  polygon,  terminating  only 
when  that  position' is  found  or  when  an  external  interrupt  signal  aborts  its 
operation. 

To  make  use  of  this  functionality,  the  strategy  to  distribute  the  list  of  placed 
polygons  disjointly  by  the  PPK  nodes  cannot  be  used.  In  this  implementation, 
each  PPK  node  holds  the  complete  list  of  placed  polygons,  and  each  node  checks 
the  feasibility  for  disjoint  sets  of  discrete  coordinates.  The  scheme  implemented 
currently  for  a  FafNER  system  with  N  PPK  nodes,  places  initially  the  working 
polygon  in  positions  ( X,Y )  for  node  1,  (X,Y  +  1)  for  node  2  and  (X.Y  +  N)  for 
node  N,  and  each  PPK  increments  automatically  its  Y  coordinate  N  units  at  a 
time.  Within  this  increment,  the  Y  coordinate  is  compared  with  the  maximum 
width  defined  for  the  placement  area  to  adjust  the  A"  coordinate  accordingly. 

With  this  implementation,  the  FCP  processor  defines  only  the  initial  position 
for  the  working  polygon,  issues  the  check  for  overlap  operation  to  the  PPK  array 
and  polls  a  status  port  from  the  PPK  array  to  wait  for  the  end  of  computation. 
When  one  PPK  node  finds  a  feasible  position,  that  position  can  only  be  accepted 
by  FCP  if  all  the  other  nodes  are  beyond  that  position.  In  this  case,  the  PPK 
nodes  still  working  are  interrupted  to  abort  their  operations,  and  the  working 
polygon  is  frozen  in  that  position  and  stored  into  a  PPK  node  determined  by 
FCP.  If  one  PPK  node  encounters  one  position  but  there  is  at  least  one  of  the 
other  nodes  behind  that  position,  the  array  must  keep  the  normal  processing 
until  that  position  is  overtaken.  This  is  necessary  because  a  better  position  may 
be  found  by  the  PPKs  that  are  still  working.  The  management  of  the  responses 
from  the  PPK  nodes  is  done  by  the  concentrator  circuit.  This  includes  a  cus¬ 
tom  controller  that  keeps  track  of  the  (A  ,Y)  coordinates  currently  present  in 
each  node  and  decides  whether  a  feasible  solution  found  by  one  node  may  be 
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accepted  or  not.  Whenever  a  final  position  is  accepted,  the  concentrator  sends 
an  interrupt  signal  to  all  PPK  nodes  and  informs  the  FCP  which  one  has  found 
that  position.  This  is  necessary  to  retrieve  from  that  PPK  the  actual  (X,  Y) 
coordinates  occupied  by  the  working  polygon. 


Table  4.  Execution  times  for  the  new  raster  algorithm 


Number  of  PPK  nodes 

4 

1  3 

2 

1  1 

Execution  time  (sec) 

12.8 

[  16.9 

25.3 

[50.6 

Improvement  (%) 

74.7% 

66.7% 

50.1% 

1  +  reSu, presented  m  table  4  show>  this  implementation  of  the  raster 

a  gorithm  enabled  an  optimal  balance  of  the  computation  load  by  all  the  process¬ 
ing  nodes.  The  variation  of  processing  times  for  different  numbers  of  processing 
nodes  shows  a  linear  increase  of  performance,  measured  as  number  of  solutions 
per  unit  of  time. 


5  Conclusions  and  future  work 


In!]!  ?  !  firSt  results  obtained  with  different  approaches  for  the 

parallelization  of  the  2D  packing  problem  into  a  custom  computing  machine 
ihe  best  implementation  obtained  so  far  (section  4.4)  has  achieved,  for  a  real 
industrial  benchmark,  a  linear  performance  with  the  number  of  processing  nodes 

with^S^tr8!^^0^6  SyStem  0nly  has  4  pro^g  nodes  and  runs 
.  h  a  sl<™  10  MHz  cIock-  h  performs  more  than  10  times  faster  then  a  present 
day  Pentium  processor,  running  a  equivalent  software  implementation.  A  new 
version  of  this  architecture  based  on  last  generation  FPGAs  chips  could  easily 
each  a  system  clock  of  20  MHz.  Using  a  Fafner  system  populated  with  16  PPK 
nodes  the  processing  times  would  be  reduced  to  1/8.  Moving  this  architecture  to 
a  custom  integrated  circuit  technology  would  further  increase  the  system  clock 

°l  '°os‘ns  ,he  reconfi6urability 

Future  developments  will  now  be  focused  on  the  development  and  evaluation 

ramZ  Tr  t0  t neighb°r  solutions^  and  to  tune  up  the  control  pa- 
has  been  settl  Id 3ptimizatl0n  meta-heuristics.  Once  the  hardware  architecture 
has  been  settled  down,  a  new  version  of  the  Fafner  system  will  be  developed 

either  based  on  last  generation  FPGA  chips  or  custom  integrated  circuits  This 

limUaTr^  t0  mHke  aUxiliary  Processor  an  effective  tool  in  accelerating  op- 

tZ^Zn7^meSJ0T  th,6  2DuPaCking  Pr°blem^  and  t0  enable  tbe  complete 
automation  of  the  cutting  plan  phase  in  textile  industries. 
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Fig.  2.  the  vectors  g0)  are  calculated  at  each  iteration  of  step  3  for  N  =  16  equations. 

4  Evaluation 


The  recursive  decoupling  algorithm  has  been  implemented  on  the  Fujitsu  AP3000 
distributed  memory  computer  [13]  using  the  message  passing  programming  model. 
We  have  used  the  MPI  programming  environment.  To  verify  the  performance 
of  the  parallel  algorithm,  we  used  a  test  diagonal  system  (with  know  solu¬ 
tion),  whose  coefficients  matrices  satisfy  the  condition,  |&*|  >  |ai|  +  |C;|,  Vz  = 
0, 1, ...,  TV  -  1.  This  test  is  described  below, 
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-1  2  -1 
-1  2  -1 
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/  u0  \ 

fl\ 
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w 

whose  exact  solution  is  an  TV-dimensional  vector  u  with  components: 


(16) 


N  +  l-i  w. 
u<  =  - - .  V*  =  1  • 

*  N+  i  >  i, 


■,N. 


(17) 


The  experiments  were  performed  on  matrices  of  size  ranging  from  16384  (214) 
to  1048576  (220)  for  the  test  (16).  As  we  can  see  in  Table  1,  the  increasing  number 
of  processors  produces  a  reduction  in  the  execution  time  of  the  algorithm.  We 
observe  that  this  method  presents  a  high  efficiency  for  all  the  sizes  of  equations. 

Fig.  3  depicts  the  experimental  results.  So,  in  Fig.  3. a  we  show  the  efficiency 
of  the  modified  sequential  algorithm  we  propose  related  to  the  initial  algorithm 
efficiency.  Thus,  Observe  than  performance  increases  more  than  91%  for  any 
value  of  TV.  On  the  other  hand,  in  Fig.  3.b  we  show  the  efficiency  for  the  par¬ 
allel  algorihtm  for  some  values  of  parameter  TV.  Efficiency  was  calculated  using 
the  execution  time  of  the  sequential  code.  The  parallel  algorithm  exceeds  the 
ideal  speedup  due  to  an  efficient  use  of  local  memories  and  the  communication 
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